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ABSTRACT 

Users  often  join  an  online  social  networking  (OSN)  site,  like  Facebook,  to  remain  social,  by  either  staying  connected 
with  friends  or  expanding  social  networks.  On  an  OSN  site,  users  generally  share  variety  of  personal  infonnation 
which  is  often  expected  to  be  visible  to  their  friends,  but  sometimes  vulnerable  to  unwarranted  access  from  others. 
The  recent  study  suggests  that  many  personal  attributes,  including  religious  and  political  affiliations,  sexual 
orientation,  relationship  status,  age,  and  gender,  are  predictable  using  users’  personal  data  from  an  OSN  site.  The 
majority  of  users  want  to  remain  socially  active,  and  protect  their  personal  data  at  the  same  time.  This  tension  leads  to 
a  user's  vulnerability,  allowing  privacy  attacks  which  can  cause  physical  and  emotional  distress  to  a  user,  sometimes 
with  dire  consequences.  For  example,  stalkers  can  make  use  of  personal  information  available  on  an  OSN  site  to  their 
personal  gain.  This  dissertation  aims  to  systematically  study  a  user  vulnerability  against  such  privacy  attacks. 

A  user  vulnerability  can  be  managed  in  three  steps:  (1)  identifying,  (2)  measuring  and  (3)  reducing  a  user 
vulnerability.  Researchers  have  long  been  identifying  vulnerabilities  arising  from  user's  personal  data,  including  user 
names,  demographic  attributes,  lists  of  friends,  wall  posts  and  associated  interactions,  multimedia  data  such  as 
photos,  audios  and  videos,  and  tagging  of  friends.  Hence,  this  research  first  proposes  a  way  to  measure  and  reduce  a 
user  vulnerability  to  protect  such  personal  data.  This  dissertation  also  proposes  an  algorithm  to  minimize  a  user’s 
vulnerability  while  maximizing  their  social  utility  values. 

To  address  these  vulnerability  concerns,  social  networking  sites  like  Facebook  usually  let  their  users  to  adjust  their 
profile  settings  so  as  to  make  some  of  their  data  invisible.  However,  users  sometimes  interact  with  others  using 
unprotected  posts  (e.g.,  posts  from  a  "Facebook  page\footnote{The  term  "Facebook  page"  refers  to  the  page  which 
are  commonly  dedicated  for  businesses,  brands  and  organizations  to  share  their  stories  and  connect  with  people.}"). 
Such  interactions  help  users  to  become  more  social  and  are  publicly  accessible  to  everyone.  Thus,  visibilities  of  these 
interactions  are  beyond  the  control  of  their  profile  settings.  I  explore  such  unprotected  interactions  so  that  users'  are 
well  aware  of  these  new  vulnerabilities  and  adopt  measures  to  mitigate  them  further.  In  particular,  {\em  are  users' 
personal  attributes  predictable  using  only  the  unprotected  interactions}?  To  answer  this  question,  I  address  a  novel 
problem  of  predictability  of  users'  personal  attributes  with  unprotected  interactions.  The  extreme  sparsity  patterns  in 
users'  unprotected  interactions  pose  a  serious  challenge.  Therefore,  I  approach  to  mitigating  the  data  sparsity 
challenge  by  designing  a  novel  attribute  prediction  framework  using  only  the  unprotected  interactions.  Experimental 
results  on  Facebook  dataset  demonstrates  that  the  proposed  framework  can  predict  users'  personal  attributes. 
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ABSTRACT 


Users  often  join  an  online  social  networking  (OSN)  site,  like  Facebook,  to  remain 
social,  by  either  staying  connected  with  friends  or  expanding  social  networks.  On  an 
OSN  site,  users  generally  share  variety  of  personal  information  which  is  often  expected 
to  be  visible  to  their  friends,  but  sometimes  vulnerable  to  unwarranted  access  from 
others.  The  recent  study  suggests  that  many  personal  attributes,  including  religious 
and  political  affiliations,  sexual  orientation,  relationship  status,  age,  and  gender,  are 
predictable  using  users’  personal  data  from  an  OSN  site.  The  majority  of  users 
want  to  remain  socially  active,  and  protect  their  personal  data  at  the  same  time. 
This  tension  leads  to  a  user’s  vulnerability,  allowing  privacy  attacks  which  can  cause 
physical  and  emotional  distress  to  a  user,  sometimes  with  dire  consequences.  For 
example,  stalkers  can  make  use  of  personal  information  available  on  an  OSN  site  to 
their  personal  gain.  This  dissertation  aims  to  systematically  study  a  user  vulnerability 
against  such  privacy  attacks. 

A  user  vulnerability  can  be  managed  in  three  steps:  (1)  identifying,  (2)  measur¬ 
ing  and  (3)  reducing  a  user  vulnerability.  Researchers  have  long  been  identifying 
vulnerabilities  arising  from  user’s  personal  data,  including  user  names,  demographic 
attributes,  lists  of  friends,  wall  posts  and  associated  interactions,  multimedia  data 
such  as  photos,  audios  and  videos,  and  tagging  of  friends.  Hence,  this  research  first 
proposes  a  way  to  measure  and  reduce  a  user  vulnerability  to  protect  such  personal 
data.  This  dissertation  also  proposes  an  algorithm  to  minimize  a  user’s  vulnerability 
while  maximizing  their  social  utility  values. 

To  address  these  vulnerability  concerns,  social  networking  sites  like  Facebook 
usually  let  their  users  to  adjust  their  profile  settings  so  as  to  make  some  of  their  data 
invisible.  However,  users  sometimes  interact  with  others  using  unprotected  posts  (e.g., 


l 


posts  from  a  “Facebook  page1”).  Such  interactions  help  users  to  become  more  social 
and  are  publicly  accessible  to  everyone.  Thus,  visibilities  of  these  interactions  are 
beyond  the  control  of  their  profile  settings.  I  explore  such  unprotected  interactions 
so  that  users’  are  well  aware  of  these  new  vulnerabilities  and  adopt  measures  to 
mitigate  them  further.  In  particular,  are  users  ’  personal  attributes  predictable  using 
only  the  unprotected  interactions ?  To  answer  this  question,  I  address  a  novel  problem 
of  predictability  of  users’  personal  attributes  with  unprotected  interactions.  The 
extreme  sparsity  patterns  in  users’  unprotected  interactions  pose  a  serious  challenge. 
Therefore,  I  approach  to  mitigating  the  data  sparsity  challenge  by  designing  a  novel 
attribute  prediction  framework  using  only  the  unprotected  interactions.  Experimental 
results  on  Facebook  dataset  demonstrates  that  the  proposed  framework  can  predict 
users’  personal  attributes. 
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The  term  "Facebook  page11  refers  to  the  page  which  are  commonly  dedicated  for  businesses, 


brands  and  organizations  to  share  their  stories  and  connect  with  people. 
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Chapter  1 


INTRODUCTION 


Social  networking  sites,  such  as  Facebook  and  GooglePlns,  have  gained  popularity 
in  recent  years,  becoming  an  integral  part  of  our  lives.  They  enable  users  to  remain 
social  by  expanding  their  ways  of  communications  in  sharing  news,  expressing  senti¬ 
ments,  exchanging  opinions,  and  making  online  friends.  However  with  the  presence 
of  adversaries,  the  convenience  of  and  low  barriers  of  access  to  social  networking  sites 
bring  about  new  challenges. 

When  a  user  joins  a  social  networking  site  to  remain  social,  she  also  expects 
the  protection  of  personal  information  from  an  unwarranted  access.  This  tension 
leads  to  a  user’s  vulnerability,  allowing  myriads  of  privacy  attacks.  Vulnerability  can 
cause  physical  and  emotional  distress  to  users,  sometimes  with  dire  consequences. 
For  example,  Facebook  founder  is  a  victim  of  stalking  and  has  publicly  admitted  to 
emotional  distress1.  In  a  more  serious  case  of  cyberstalking,  a  perpetrator  trolled 
women’s  Facebook  pages  searching  for  clues  that  allowed  him  to  take  over  their  email 
accounts2.  Furthermore,  an  unwarranted  access  to  personal  data  on  a  social  network¬ 
ing  site  can  aid  not  only  the  cyberbullying  of  teenagers  but  also  the  cyberstalking 
and  cyberharassment  of  adults3. 

On  a  social  networking  site,  an  individual  user  can  share  a  large  amount  of  personal 

information  through  channels  such  as  the  user’s  profile,  frequent  status  updates,  and 

1http : //www. tmz . com/20 11/02/07/mark-zuckerberg-restraining-order-facebook- 

social-network-santa-clara-county-stalker-letters-priscilla-chan 

2http :  //usatoday30  .usatoday .  com/tech./ news/201 1-07- 23-f  acebook-stalker- 

sentenced_n .  htm 

3en . Wikipedia . org/ wiki/Cyberbullying 
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posts  and  subsequent  interactions.  The  owner  of  the  site  (Facebook,  for  example) 
stores  such  personal  information,  whereas  some  (Facebook)  users,  including  friends, 
has  direct  access  to  it.  From  a  user’s  vulnerability  point  of  view,  this  gives  rise  to 
two  types  of  concerns.  First,  sometimes  the  owner  of  the  site  sells  the  personal  data 
to  the  third  party  users,  including  advertising  agencies,  for  generating  more  revenue. 
Second,  inadequate  privacy  mechanisms  expose  the  personal  data  to  an  unwarranted 
access  from  the  malicious  users  and  applications  (apps).  The  focus  of  this  dissertation 
is  exclusively  on  the  second  concern,  whereas  previous  research  on  privacy-preserving 
data-mining  [23]  proposes  different  techniques  to  address  the  first  concern. 

To  alleviate  the  vulnerabilities,  users  are  often  left  with  profile  settings  to  mark 
their  personal  data,  including  demographic  profiles,  status  updates,  lists  of  friends, 
videos,  photos,  and  interactions  on  posts,  invisible  to  others.  Also,  the  amount  of 
shared  information  varies  for  different  users.  Active  users  generally  share  more  infor¬ 
mation  whereas  inactive  users  share  less  information.  Based  on  profile  settings  and 
amount  of  available  information,  users  can  be  categorized  into  four  types:  (1)  active 
users  with  public  settings,  (2)  active  users  with  private  settings,  (3)  inactive  users  with 
public  settings,  and  (4)  inactive  users  with  private  settings.  Figure  1.1  shows  users’ 
classification  into  four  quadrants.  Users  in  quadrant  one  (Q1  users)  generally  pro¬ 
vides  maximum  amount  of  personal  information  including  usernames,  demographic 
attributes,  lists  of  friends,  posts  and  associated  interactions,  and  tagging  of  friends. 
In  comparison  with  Q1  users,  inactive  users  in  quadrant  four  (Q4  users)  provides  less 
amount  of  information.  Usernames,  demographic  attributes,  and  lists  of  friends  are 
generally  available  from  Q4  users.  Users  in  quadrant  two  (Q2  users)  are  trying  to  be 
social  at  the  same  time  marking  their  profile  setting  private  to  secure  their  personal 
data.  New  mechanisms  such  as  Facebook  page  allow  Q2  users  to  interact  through 
posts  without  requiring  them  to  be  friends,  while  keeping  their  personal  information, 
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Figure  1.1:  User  Types  based  on  Available  Information 


including  demographic  profiles,  lists  of  friends,  and  interactions  with  friends,  private. 
Users’  interactions  on  these  pages  are  often  centrally  administered  and  publicly  avail¬ 
able  for  everyone.  Based  on  whether  a  user  can  control  the  visibility  of  her  actions,  a 
post  can  be  categorized  into  two  parts:  personal  or  public  post.  Personal  post  is  a  post 
which  can  be  controlled  by  a  user’s  individual  profile  settings,  otherwise  it  is  referred 
as  a  public  post.  Q2  users  often  provides  usernames,  and  public  posts  and  associated 
interactions.  The  least  amount  of  information  is  available  for  users  in  quadrant  three 
(Q3  users).  For  a  Facebook  user,  username  is  the  minimum  available  information. 

Based  on  the  literature  survey,  there  are  three  steps  to  manage  one’s  vulnerability: 
(1)  identifying,  (2)  measuring  and  (3)  reducing  a  user  vulnerability.  Hence  for  each 
of  the  four  types  of  user,  the  vulnerability  can  be  managed  by  addressing  these  three 
steps  separately.  Since  active  users  generously  provide  information  on  an  OSN  site, 
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this  dissertation  mainly  focuses  on  active  users  (i.e.,  Q1  and  Q2  users).  Researchers 
have  long  been  identifying  the  privacy  attacks  associated  with  Q1  users,  whereas 
Q2  users  are  mostly  unexplored.  Hence,  this  dissertation  focuses  on  measuring  and 
reducing  Q1  users’  vulnerabilities.  Furthermore,  dissertation  also  predicts  the  Q2 
users’  personal  attributes  using  their  interactions  on  the  public  posts.  To  the  best  of 
my  knowledge,  this  is  the  first  attempt  to  identify  Q2  users’  vulnerabilities. 

The  remainder  of  this  dissertation  is  organized  as  follows.  Chapter  2  provides  a 
novel  way  to  measure  vulnerabilities  of  Q1  users,  and  proposes  a  mechanism  to  reduce 
them  further.  Chapter  3  identifies  the  Q2  users’  vulnerabilities  using  interactions 
on  the  public  posts.  Experiments  in  both  Chapters  are  performed  on  the  large- 
scale  Facebook  datasets.  A  brief  literature  review  is  provided  in  Chapter  4.  Finally, 
conclusion  and  future  work  is  outlined  in  Chapter  5. 
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Chapter  2 


MANAGING  VULNERABILITY  OF  ACTIVE  USERS  WITH  PUBLIC 

SETTINGS 


In  this  chapter,  my  focus  is  on  active  users  which  mark  majority  of  their  profile 
settings  public  (i.e.,  Q1  users).  Such  users  on  a  social  networking  site  can  choose 
to  reveal  their  personal  information  using  user-names,  some  demographic  attributes, 
wall-posts  and  their  associated  interactions  (e.g.,  likes,  comments,  reply  and  shares), 
lists  of  friends,  and  tagging  of  friends.  Researchers  have  shown  that  users’  personal 
information  could  be  used  in  predicting  the  personal  attributes  and  traits,  including 
gender,  age,  location,  religious  and  political  affiliations,  relationship  status,  sexual 
orientation,  ethnicity,  educational  level,  social  ties,  parental  separation,  openness, 
conscientiousness,  extraversion,  agreeableness  and  neuroticism.  Chapter  4  reviews 
the  literature  on  identifying  a  user’s  vulnerability  arising  from  the  publicly  available 
information.  As  different  users  expose  to  such  privacy  attacks  differently,  this  Chapter 
focuses  on  (1)  measuring  a  user  vulnerability,  and  (2)  provide  a  way  to  mitigate  it.  For 
the  rest  of  this  Chapter,  active  users  with  public  profile  settings  are  simply  referred 
as  users,  unless  otherwise  stated. 

In  this  chapter,  I  show  that  it  is  feasible  to  measure  a  user’s  vulnerability  based  on 
three  factors:  (1)  user  privacy  settings  can  reveal  personal  information;  (2)  a  user’s 
action  on  a  social  networking  site  can  expose  their  friends’  personal  information;  and 
(3)  friends’  action  on  a  social  networking  site  can  reveal  user’s  personal  informa¬ 
tion.  Based  on  these  three  factors,  I  later  show  that  how  user’s  vulnerability  can  be 
measured  and  assessed.  This  vulnerability  measure  enables  me  to  quantify  users  vul¬ 
nerability,  and  identify  their  vulnerable  friends.  As  we  draw  parallels  between  users 
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Figure  2.1:  Which  One  Vulnerable  Friend  to  Unfriend  from  (A,B,C)  for  User  U1 

and  their  friends,  I  am  interested  in  finding  an  effective  mechanism  that  could  make 
users  less  vulnerable.  Unfriending  with  vulnerable  friends  reduces  users  vulnerability. 
This  mechanism  has  been  validated  using  extensive  experiments  on  users  and  their 
friends  on  Facebook. 

Sociologists,  psychologists  and  economists  [58,  76,  7,  8]  have  been  researching 
the  impact  of  social  interactions  on  the  social  utility  value  of  a  user  and  the  society. 
Although  unfriending  vulnerable  friends  can  reduce  vulnerability,  this  simple  strategy 
can  limit  social  interaction  opportunities  among  users.  Besides  limiting  interaction 
opportunities,  unfriending  socially  important  or  valuable  friends  can  backfire  and 
reduce  one’s  social  status  as  well.  Social  importance  can  be  measured  in  terms  of  social 
utility  [48].  One  such  utility  is  the  nodal  degree  of  a  friend.  Refer  to  Figure  2.1:  if  A 
is  the  most  vulnerable  but  also  most  popular  among  f/’s  friends,  could  U  unfriend  his 
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other  vulnerable  friends  instead  of  A  in  order  to  reduce  vulnerability?  Herewith,  the 
additional  new  challenge  is  how  to  maintain  low  vulnerability  and  high  social  utility 
for  a  social  networking  user.  The  work  in  this  chapter  addresses  the  challenge  by 
developing  novel  solutions  to  the  following  questions  without  suggesting  any  structural 
change  to  a  social  networking  site. 

1.  How  can  we  measure  and  assess  user’s  vulnerability?  Is  there  an  effective  mech¬ 
anism  to  make  users  less  vulnerable? 

2.  What  is  the  social  cost  of  vulnerability  reduction  mechanism  on  user’s  social 
utility  ?  How  can  we  achieve  balance  between  user’s  vulnerability  and  social 
utility? 

The  rest  of  the  chapter  is  organized  as  follows.  Section  2.1  studies  the  collectible 
statistics  from  a  social  networking  site  and  present  a  quantifiable  measure  to  evaluate 
users  vulnerability  and  define  the  problem  of  identifying  vulnerable  friends  [32] .  Sec¬ 
tion  2.2  proposes  a  methodology  and  measures  for  evaluating  whether  or  not  a  user  is 
vulnerable  and  how  to  adjust  a  user’s  network  to  best  deal  with  threats  presented  by 
vulnerable  friends  [32],  Section  2.3  presents  a  constrained  vulnerability  minimization 
problem.  To  this  end,  I  formulate  two  novel  optimization  problems  of  vulnerability 
minimization.  I  also  discuss  the  hardness  of  the  problem  and  provide  approximation 
guarantees  to  efficient  algorithms  [33].  Section  2.4  conducts  an  empirical  study  to 
evaluate  methods  that  can  be  manipulated  to  make  users  less  vulnerable,  compare 
the  performance  of  an  optimal  algorithm  with  that  of  intuitive  heuristic  methods,  and 
discuss  the  approach  that  can  be  used  to  assess  the  impact  of  new  friends  to  a  user’s 
network.  I  also  evaluate  methods  that  make  users  less  vulnerable  while  retaining 
acceptable  level  of  social  utility  values  of  vulnerable  friends.  Finally,  I  summarize  the 
chapter  with  possible  future  research  directions  in  Section  2.5. 
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2.1  Measuring  User  Vulnerability 


Every  user  on  a  social  networking  site  can  choose  to  reveal  their  personal  infor¬ 
mation  using  a  range  of  attributes.  Figure  2.1  shows  an  illustrative  example  where  a 
user  U  has  eight  friends  ( A ,  B,  C,  D ,  E,  F,  G,  H).  Based  on  their  preferences,  friends 
assumed  to  be  revealing  different  attributes  from  the  available  lists  of  personal  at¬ 
tributes.  In  this  section,  I  propose  a  measure  to  quantify  user  I/’s  vulnerability.  I 
first  divide  the  user’s  personal  attributes  into  two  sets,  individual  and  community 
attributes.  Individual  attributes  (I-attributes)  characterize  individual  user  informa¬ 
tion,  including  personal  information  such  as  gender,  birth  date,  phone  number,  home 
address,  group  memberships,  etc.  Community  attributes  (C-attributes)  characterize 
information  about  friends  of  a  user,  including  friends  that  are  traceable  from  a  user’s 
prohle  (i.e.,  user’s  friend  list),  tagged  pictures,  wall  interactions,  etc.  These  attributes 
are  always  accessible  to  friends  but  may  not  to  the  other  users.  A  user’s  vulnerability 
depends  on  the  visibility  and  exposure  of  a  user’s  prohle  through  not  only  attributes 
settings  but  also  his  friends. 

Oftentimes  users  on  a  social  networking  site  are  unaware  that  they  could  pose  a 
threat  to  their  friends  due  to  their  vulnerability.  In  this  chapter,  I  show  that  it  is 
feasible  to  measure  a  user’s  vulnerability  based  on  three  factors:  (1)  user’s  privacy 
settings  that  can  reveal  personal  information;  (2)  a  user’s  action  on  a  social  networking 
site  that  can  expose  their  friends’  personal  information;  and  (3)  friends’  action  on  a 
social  networking  site  that  can  reveal  user’s  personal  information.  Based  on  these 
factors,  I  formally  present  one  of  the  earliest  models  for  vulnerability  reduction. 

Definition  1  (I-index)  I-index  estimates  how  much  risk  to  privacy  a  user  can  incur 
by  allowing  individual  attributes  to  be  accessible  or  visible  to  other  users.  A  user  who 
ignores  or  is  unaware  of  privacy  settings  is  a  threat  to  himself.  I-index  is  defined  as 


a  function  of  individual  attributes  (I- attributes).  I-index  of  user  u  is  given  by 


In  =  f(Au), 


(2.1) 


where  I v  G  [0,1]  7  Au  =  {aUti\Wi,  au^  G  {0,1}}  is  I-attribute  set  for  user  u,  arid  au ^ 
is  a  status  of  a  i-th  I-attribute  for  a  user  u.  aUti  =  1  indicates  user  u  has  enabled 
i-th  I-attribute  to  be  visible  to  everyone  otherwise  non-visible  (may  be  sensitive  for  a 
user).  Note  a  user  attribute  visible  to  only  friends  is  marked  as  disabled. 


Table  2.2  shows  statistics  of  commonly  found  I-attributes  on  Facebook  (Refer  to 
Section  2.4.1  for  details  on  Facebook  dataset  consists  of  2,056,646  users).  The  last 
column  in  the  table  lists  the  percentage  of  people  who  enable  the  particular  attribute 
to  be  visible.  For  example,  7,430  (0.36%)  Facebook  users  enabled  their  mobile  phone 
numbers  to  be  visible.  I  define  the  sensitivity  (weight),  of  an  attribute  as  a  percentage 
of  non-visibility.  Hence,  the  sensitivity  of  a  mobile  phone  number  according  to  the 
proposed  Facebook  dataset  is  99.64.  This  means  that  users  do  not  usually  disclose 
their  mobile  phone  number  to  other  users.  Users  that  do  disclose  phone  numbers 
have  a  propensity  to  vulnerability  because  they  disclose  more  sensitive  information 
in  their  profiles. 

I  used  normalized  weighted  average  to  estimate  I-index.  I-index  for  each  profile 
user  u  is  given  by, 


Iu  =  f(Au)  = 


En 

i=\  Wi  *  &u,i 


(2.2) 


En  5 

i=iwi 

where  wy  is  the  sensitivity  (weight)  of  an  i-th  I-attribute,  n  is  the  total  number  of 
I-attributes  available  via  a  social  networking  site  profile  and  aUij  =  1  if  i-th  I-attribute 
is  visible  otherwise  the  attribute  is  not  visible  (i.e.,  sensitive  to  user  u ).  Iu  G  [0,1]. 
Iu  =  1  indicates  user  u  has  marked  all  I-attributes  to  be  visible.  On  the  other  hand, 
Iu  =  0  indicates  user  u  has  marked  all  I-attributes  to  be  non-visible. 
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Attributes 

Users  # 

Per  (%) 

Attributes 

User  # 

Per  (%) 

I-attributes: 

Children 

86,609 

4.21 

Current  City 

620,401 

30.17 

Networks 

284,482 

13.83 

Hometown 

727,674 

35.38 

Parents 

73,887 

3.49 

Gender 

1,681,673 

81.77 

Bio 

199,070 

9.68 

Birthday 

67,834 

3.30 

Interested  in 

383,724 

18.66 

Relationship  status 

539,612 

26.24 

Looking  for 

449,498 

21.86 

Siblings 

244,658 

11.90 

Music 

941,340 

45.77 

Education  and  work 

516,848 

25.13 

Books 

281,346 

13.68 

Like  and  interests 

1,369,080 

66.57 

Movies 

574,243 

27.92 

Email 

27,103 

1.32 

Television 

684,843 

33.30 

Mobile  number 

7,430 

0.36 

Activities 

385,417 

18.74 

Website 

128,776 

6.26 

Interests 

308,229 

14.99 

Home  address 

7,580 

0.37 

C-attributes: 

Political  Views 

24,438 

1.19 

Friends  trace 

1,481,472 

72.03 

Religious  Views 

33,036 

1.61 

Total  users 

2,056,646 

Figure  2.2:  Attributes  Statistics  of  Facebook  Dataset. 


From  the  viewpoint  of  optimization,  it  is  common  to  use  linear  sum  as  an  objec¬ 
tive  function  or  constraints  to  reduce  the  overall  complexity  of  finding  the  optimal 
solution  (e.g.  Linear  Programming).  Inspired  from  this,  the  chapter  proposes  a  linear 
sum  (weighted  average)  function  of  individual  attributes  and  their  sensitivity  weight 
(percentage  of  non-visibility)  to  compute  the  I-index  of  a  user  (Equation  (2.2)).  The 
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proposed  linear  sum  function  is  a  simple,  and  captures  the  intuition  that  vulnerability 
of  a  user  increases  with  more  visibility  of  some  attributes.  For  example,  Table  2.2 
shows  that  the  profile  revealing  “religious  (1.61%)”  and  “political  (1.19%)”  affiliation 
values  should  be  more  vulnerable  in  comparison  with  the  profile  revealing  “Gender 
(81.77%)”  and  “Relationship  status  (26.24%)”  values. 

Definition  2  (C-index)  C-index  estimates  how  much  threat  a  user  can  pose  to  their 
friends  by  making  community  attributes  accessible  or  visible  to  other  users.  Users  who 
ignore  and  are  unaware  of  privacy  settings  of  community  attributes  can  create  risk 
to  the  entire  community  of  friends.  C-index  is  defined  as  a  function  of  community 
attributes  (C- attributes).  C-index  for  a  user  u  is  given  by 

Cu  =  g(Bu),  (2.3) 

where  Cu  G  [0, 1  \  ,  Bu  =  {bu^\7i,  bu.t  e  Z+}  is  C-attributes  set  for  useru,  bu^  indicates 
the  number  of  friends  affected  when  a  corresponding  C-attribute  is  manifested,  and 
Z+  is  the  set  of  positive  integers.  I  ignore  attributes  marked  as  non-visible.  The 
Facebook  dataset  has  only  one  C-attribute  (see  Table  2.2)  which  suggests  how  many 
friends  are  traceable  ( via  a  friend  relationship)  from  an  individual  user.  1,481,472 
(72.03%)  Facebook  users  in  the  dataset  allowed  friends  to  trace  to  other  users.  Thus, 
a  large  portion  of  users  are  either  not  careful  or  not  aware  of  the  privacy  concerns  of 
their  friends. 

A  vulnerable  user,  u,  can  pose  threat  to  his  friends.  The  amount  of  the  threat 
increases  with  the  number  of  friends  that  are  put  at  risk.  However,  the  rate  of  the 
increment  decreases  as  more  friends  are  put  at  risk.  To  appropriately  represent  this 
threat  change,  I  choose  a  convex,  non-decreasing  log  function  to  estimate  the  threat 
for  each  user  based  on  the  number  friends  placed  at  risk  by  each  C-attribute.  Hence, 
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(a)  I-index  (Red)  and  C-index  (Blue)  (b)  P-index  (Red)  and  V-index  (Blue) 

Figure  2.3:  Relationship  Among  Index  Values  for  Each  User. 


C-index  for  a  user  u  is  calculated  as 


Cu  =  g(Bu) 


a  log 

4  *771 


(2.4) 


where  m  is  the  total  number  of  C-attributes  possible  on  a  social  networking  site,  and 
constant  4  is  chosen  because  Cu  G  [0, 1]  and  none  of  the  Facebook  users  in  the  dataset 
has  more  than  104  friends.  Cu  >  0  indicates  user  u  has  allowed  everyone  to  trace 
friends  through  their  own  profile.  On  the  other  hand,  Cu  —  0  indicates  that  all  the 
friends  (except  one)  of  a  user  u  are  non-traceable  through  a  profile. 

Fig.  2.3a  shows  I-index  and  C-index  for  randomly  chosen  100K  Facebook  users. 
Note  that  users  are  sorted  in  ascending  order  of  their  I-indexes  which  gives  curve¬ 
like  impression  on  plotting  I-index.  The  X-axis  and  Y-axis  indicate  users  and  their 
corresponding  I  and  C-index  values,  respectively.  Fig.  2.3a  demonstrates  that  for  the 
majority  of  users,  the  C-index  value  is  greater  than  the  corresponding  I-index  value. 
This  highlights  the  one  finding  of  this  chapter  that  a  large  portion  of  users  are  either 
not  careful  or  not  aware  of  the  privacy  concerns  of  their  friends. 


Definition  3  (P-index)  It  estimates  how  public  (visible)  or  private  (non-visible)  a 
user  is  on  a  social  networking  site.  It  shows  how  much  an  individual  user  aims  to 
protect  himself  as  well  as  his  friends.  P-index  is  defined  as  a  function  of  I-index  and 
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C-index.  P-index  of  user  u  is  given  by 

Pu=j(Iu,Cu ),  (2.5) 

where  Pu  G  [0,1]. 

I  choose  a  simple,  weighted  average  function  to  calculate  P-index  for  each  Face- 
book  user  in  the  dataset. 


Pu  =  a*Iu  +  (l~a)  *CU  =  a*  f(Au )  +  (1  -  a)  *  g(Bu), 


(2.6) 


where  a  G  [0,1].  Substituting  Eq(2.6)  with  Eq(2.2)  and  Eq(2.4),  I  get 

y^n\Wi*aUi  ,  .  Ymi  log(6ui) 

P„  =  a*  - —  +  (1  -  a)  *  B'  ’  ’ 


(2.7) 


En  ,  ^  _  —  j 

i=1  Wi  4  *m 

Different  users  may  have  different  priorities  about  friends  and  may  have  different 
perspectives  about  vulnerability.  Tunable  parameter  a  can  be  set  to  address  the 
needs  of  different  users.  For  example,  one  may  choose  a  <  0.5  to  deemphasize  the 
individual  attributes’  visibility;  or  one  may  choose  a  >  0.5  to  emphasize  the  individual 
attributes’  visibility.  For  experiments,  I  set  a  =  0.5  to  put  equal  weights  to  individual 
and  community  attributes. 

The  P-index  address  the  first  two  factors  of  the  user  vulnerability  estimation 
discussed  above,  i.e.,  user’s  privacy  settings  (1-index)  and  actions  to  expose  friends 
(C-index).  Thus,  1  follow  a  commonly  used  optimization  formulation  as  in  linear 
program  as  1  did  for  1-index.  Mathematically,  any  function  which  can  combine  the 
1-index  and  C-index  and  ranges  between  [0,1]  can  be  used.  In  this  chapter,  the  above 
simple  weighted  average  function  works  well  and  it  can  also  be  easily  discerned  in 
applying  these  indices  for  various  needs. 


Definition  4  (V-index)  V-index  estimates  how  vulnerable  a  user  is  on  a  social  net¬ 
working  site.  Thus  far,  I  have  provided  three  indexes,  I-index,  C-index,  and  P-index, 
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for  a  user  based  on  the  visibility  of  I-attributes  and  C-attributes.  Vulnerability  of  a 
user  depends  on  privacy  settings  of  self,  friends,  their  friends,  and  so  on.  Intuitively, 
as  the  distance  between  a  user  and  other  users  on  a  social  networking  site  increases, 
the  marginal  risk  of  vulnerability  decreases  the  further  away  a  user  is  from  a  vulnera¬ 
ble  user.  Hence,  I  only  consider  a  user  and  friends  in  estimating  the  vulnerability  of 
a  user.  V-index  of  a  user  depends  on  the  P-indexes  of  friends  and  him.  V-index  of 
user  u  is  defined  as 

Vu  =  h(PFu  u{u}),  (2.8) 

where  Fu  is  the  set  of  friends  of  user  u,  Pfuu{u}  =  {P%\i  £  Fu  U  {«}},  and  Vu  G  [0, 1]. 
I  rewrite  the  above  notation  without  loss  of  generality  as, 


Vu  =  h(Fu  U  {-u})  (2.9) 

Figure  2.3b  shows  the  P-index  and  V-index  for  100K  randomly  chosen  users  (the 
same  users  chosen  for  Figure  2.3a).  Note  that  users  are  sorted  in  ascending  order 
of  their  P-indexes  which  gives  curve-like  impression  on  plotting  P-index.  The  X-axis 
and  Y-axis  indicate  users  and  their  index  values  respectively.  A  simple,  weighted 
average  function  is  used  to  plot  V-index  for  each  user, 

Pu  +  Y2ieFu  Pi 


V,  = 


\FU\  +  1 


(2.10) 


The  design  of  V-index  function  has  a  direct  impact  on  the  complexity  of  solving 
the  problem  of  identifying  'k'  vulnerable  friends  (described  in  detail  next).  1  later 
show  that  this  problem  can  be  solved  in  polynomial  time,  if  the  V-index  is  computed 
as  shown  in  Equation  (  2.10). 


2.2  Reducing  User  Vulnerability 


Next,  1  will  provide  the  mechanism  based  on  unfriending  to  reduce  user  vulnera¬ 
bility. 
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Definition  5  (A  vulnerable  friend)  A  user’s  vulnerable  friend  is  defined  as  a  friend 
whose  unfriending  will  lower  the  V-index  score  of  a  user.  The  V-index  of  a  user  u 
upon  removing  the  vulnerable  friend  v  is  given  by 

K  =  KFu  u  {«}  \  {«})  (2.11) 

By  the  definition  of  a  vulnerable  friend,  Vf  <VU. 

The  definition  of  a  vulnerable  friend  can  be  generalized  to  ^-vulnerable  friends. 

Definition  6  (/c-vulnerable  friends)  k-vulnerable  friends  of  a  user  are  k  friends 
whose  unfriending  will  lower  the  V-index  score  of  a  user.  The  V-index  of  user,  u, 
upon  removing  k  vulnerable  friends  {vi,  •  •  •  ,  vk}  is  given  by 

K  =  HK  U  {«}  \  {ui,  •  •  •  ,vk}),  (2.12) 

By  the  definition  of  k-vulnerable  friends,  V'U<VU. 

Based  on  Definitions  5  and  6,  a  user’s  friends  can  be  divided  into  two  sets:  (1)  an 
initial  set  of  vulnerable  friends  Du  0,  who  are  responsible  for  increasing  the  V-index 
value  of  user  u,  and  (2)  a  set  of  non-vulnerable  friends  Fu  \  DUt0,  who  are  responsible 
for  decreasing  the  V-index  value  of  user  u.  Hence,  Eq(2.9)  can  be  rewritten  as 

Vu  =  h({Fu  \  DUfi}  U  {w}  U  DUfi)  (2-13) 

In  order  to  minimize  the  user  vulnerability1  function  h(-),  I  have  to  unfriend 
vulnerable  friends  Du0.  The  user  vulnerability  minimization  problem  seeks,  for  a 
parameter  k  from  user  u,  to  find  a  new  set  of  k  vulnerable  friends  Du  C  Du  0  to 
1A  user  vulnerability  can  also  be  minimized  by  (1)  disabling  visibility  of  sensitive  user’s  attributes 
(such  as  phone  number,  email  address,  home  address,  etc.)  and  not  exposing  friends  to  others;  and 
by  (2)  requesting  (or  negotiating  with)  the  vulnerable  friends  to  lower  their  vulnerability  index. 
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unfriend.  The  minimum  new  V-index  for  user  u  is  achieved  after  unfriending  the 
selected  vulnerable  friends  Du ,  where  \DU\  <  k.  The  new  V-index  of  user,  u,  upon 
removing  selected  set  of  vulnerable  friends  Du  C  DuQ  ,  is  given  by 

K  =  h{{Fu  \  Du, 0}  U  {«}  U  {Du, 0  \  Du})  (2.14) 

By  the  definition  of  k- vulnerable  friends,  V'U<VU. 

This  problem  of  minimizing  the  vulnerability  of  user  u  is  equivalently  stated  as 
finding  the  set  of  at  most  k  vulnerable  friends  Du  C  Du0  to  unfriend,  who  are 
responsible  for  maximizing  the  vulnerability  of  user  u. 

Let  cr{Du)  be  the  estimate  how  vulnerable  user  u  is  due  to  vulnerable  friends 
Du.  Thus,  I  maximize  function  a  :  2lDu’°l  — >  R*,  where  M*  is  the  set  of  non-negative 
real  numbers.  Note  that,  h,(Du)  is  not  the  same  as  <j(Du),  even  though  function 
(x(-)  depends  on  function  h(-).  For  example,  if  h(-)  is  a  simple  average  function  of 
P-indexes  of  user  and  friends,  then  function  a(-)  estimates  the  total  P-index  value  of 
all  the  vulnerable  friends.  In  other  words,  function  cr(Du )  estimates  the  vulnerability 
induced  by  the  vulnerable  friends  Du  on  user  u.  The  Vulnerability  Maximization  with 
Cardinality  constraint  problem  (VMC)  is  formulated  as  follows 
VMC(.DUio,  P ,  k,  Vu)  -  Instance:  The  finite  set  of  initial  vulnerable  friends  DU}0  C  Fu 
of  user  u,  P-index  P*  G  [0, 1]  Vi  G  DU) 0,  a  vector  P  =  (Pi,  •  •  •  ,  P|dUj0|);  and  constant 
k  G  Z*  and  Vu  G  M*.  Question:  Is  there  a  subset  Du  C  Du>0  such  that  \DU\  <  k  and 
<J(DU)  >  Vu  ? 

The  above  problem  can  be  solved  in  polynomial  time,  if  function  cr(-)  is  a  linear 
function  of  P-index  values  of  vulnerable  friends.  The  linear  function  cr(-)  can  be 
represented  as, 

\Du,o\ 

^{Du, 0)  =  \  *  Pit  Vi  G  Dufi,  A i  G  M*  (2-15) 

1=1 
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Since  vulnerable  friends  make  a  user  profile  less  secure,  A*  cannot  be  a  negative 
real  number.  For  Eq(2.15),  the  most  vulnerable  friend  d  G  Du0  is  given  by, 

d  =  max  o-(DUfi)  =  max{Aj  *  P?:|Vi  G  DU) 0}  (2-16) 

veDufi  i 

If  I  repetitively  identify  the  most  vulnerable  friend  d,  using  Eq(2.16),  for  k  (or  n, 
if  n  <  k )  times  and  remove  d  from  DU)q  for  every  run,  I  can  get  /^-vulnerable  friends 
to  maximally  reduce  user  vulnerability. 

I  do  not  assume  that  cr(-)  is  linear.  Later,  I  will  discuss  how  to  solve  VMC 
problem,  when  u(-)  is  non-linear. 

So  far  I  aim  to  reduce  the  vulnerability  of  a  user  without  considering  its  social 
impact.  If  the  vulnerable  user  selected  to  unfriend  is  also  socially  valuable,  then  it 
could  lead  to  a  serious  social  problem.  For  example,  a  user  may  not  want  to  unfriend 
his  girlfriend,  though  vulnerable.  Next  I  investigate  this  problem  of  user  vulnerability 
reduction  with  social  utility  constraints. 

2.3  Reducing  LTser  Vulnerability  with  Social  Utility 

An  essential  function  of  social  networking  sites  is  to  help  users  to  be  social.  Al¬ 
though  unfriending  vulnerable  friends  from  a  user’s  social  network  sometimes  can 
reduce  vulnerability,  this  strategy  could  sometimes  significantly  limit  social  interac¬ 
tion  among  users. 

The  social  utility  can  be  defined  by  different  measures  of  social  network  analysis. 
In  this  work,  I  use  three  basic  measures:  nodal  degree  or  simply  degree,  tie  strength, 
and  number  of  common  friends.  A  friend  with  high  degree  means  she  is  a  popular  one; 
when  two  friends  have  a  strong  tie  [25]  or  have  a  large  number  of  common  friends,  they 
are  very  close  friends.  In  other  words,  I  can  employ  these  measures  help  determine 
the  consequence  of  unfriending  on  a  user.  Unfriending  a  vulnerable  friend  with  high 
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Vulnerable  friends 

P-index 

Degree 

Tie  Strength 

^Common  Friends 

A 

0.9 

100 

0.8 

30 

B 

0.7 

200 

0.5 

50 

C 

0.5 

300 

0.3 

10 

Tabic  2.1:  Unfriending  One  Vulnerable  Friend  from  (A,B,C)  to  Minimize  User  C/’s 
Vulnerability  while  Retaining  Socially  Valuable  Friends.  A  High  P-index  Suggests 
High  Vulnerability.  Social  Utility  can  be  Materialized  in  Various  Forms.  In  This 
Work,  I  Use  Three  Common  Measures:  One’s  Nodal  Degree,  Tie  Strength,  and  Num¬ 
ber  of  Common  Friends. 

degree  could  limit  the  user’s  potential  to  make  more  friends,  which  makes  user  more 
reclusive  and  defeats  the  purpose  of  social  networking.  Generally,  family  members, 
best  friends,  and  girl  or  boy  friend  are  examples  of  strong  ties.  Though  unfriending 
them  could  reduce  one’s  vulnerability,  it  would  not  be  desirable.  When  the  user  and 
his  friend  share  a  large  number  of  friends,  removing  the  friend  could  affect  structural 
balance  [20]  of  the  social  network  and  the  user’s  clustering  coefficient  [78].  Thus,  it 
is  essential  to  consider  the  social  utility  of  vulnerable  friends  before  unfriending  them 
in  order  to  reduce  vulnerability. 

There  exist  many  social  utility  measures  that  can  be  categorized  into  two  types  [4] : 
(1)  social  utility  measures  related  to  the  vulnerable  friends,  and  (2)  social  utility 
measures  related  to  the  relationship  between  user  and  each  of  his  vulnerable  friends. 
Table  2.2  lists  some  social  utility  measures  available  for  a  user  u  to  consider  while 
selecting  the  vulnerable  friends  Du  C  Du_q  to  unfriend.  Figure  2.1  shows  an  illus¬ 
trative  example  where  a  user  U  has  three  vulnerable  friends  (A,B,C).  Table  2.1 
lists  users  P-indexes  (indicating  user  visibility)  and  their  social  utility  measures  in- 
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eluding  degree,  tie  strength  and  number  of  common  friends.  If  1  do  not  consider  the 
social  utility  measures  of  vulnerable  friends,  the  most  vulnerable  friend  A  should  be 
removed  to  minimize  user  f/’s  vulnerability.  If  1  consider  tie  strength,  user  A  is  the 
most  valuable  for  user  U .  Similarly,  friends  B  and  C  are  also  valuable  because  B 
shares  the  most  number  of  common  friends  and  C  has  the  highest  degree.  Under 
these  different  circumstances,  1  now  study  which  friend  U  should  unfriend  in  order  to 
reduce  vulnerability  with  the  constraint  of  retaining  social  utility. 
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Degree  (number  of  friends)  of  a  vulnerable  friend. 


Social  utility 
measures  related 
to  the  initial  set 
of  vulnerable 
friends  Du$  of  a 
user  u. 


Local  clustering  coefficient  (density)  of  a  vulnerable  friend. 
Number  of  closed  triads  of  a  vulnerable  friend. 

Number  of  open  triads  of  a  vulnerable  friend. 

Number  of  posts  (social  activity)  by  a  vulnerable  friend. 

Number  of  responses  (social  activity)  by  a  vulnerable  friend. 
Popularity  (may  be  based  on  replies  on  posts)  of  a  vulnerable  friend. 


Social  utility 
measures 
related  to  the 


Influence  index  of  a  vulnerable  friend. 

Number  of  common  friends  between  user  and  vulnerable  friend. 
Tie  strength  between  user  and  vulnerable  friend. 


relationship 
between  user  u 
and  each 
vulnerable 
friend  in  Du0 


Number  of  uncommon  friends  between  user  and  vulnerable  friend. 

Number  of  responses  (social  activity)  by  a  user  u  on  a  vulnerable 
friend’s  post. 

Number  of  responses  (social  activity)  by  a  vulnerable  friend  on  a 
user’s  post. 


Homophily  (similarity)  index  between  user  and  a  vulnerable  friend. 


Table  2.2:  Types  of  Social  Utility  Measures 


Next  I  define  the  optimization  problem  for  reducing  a  user  vulnerability  while 
retaining  socially  valuable  friends. 

Definition  7  (Social  Utility  Loss)  It  estimates  how  much  a  user  loses  on  a  social 
networking  site  after  unfriending  friends.  Social  utility  loss  of  user  u  depends  on 
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social  utility  values  of  u’s  friends.  For  a  given  social  utility  measure,  the  social  utility 
loss  for  a  user  u  after  unfriending  with  a  given  set  of  friends  A  C  Fu  is  given  by 

Lu  =  C  (Sa),  (2.17) 

where  Sa  =  {<S)|i  *=  7L  C  Fu,  Si  G  M*},  S,  represents  the  social  utility  measure  of  user 
i,  and  L„  e  I*. 

Assuming  £(•)  to  be  a  simple  additive  function,  I  have 

Lu=  S'  (2-18) 

ieACFu 

Similar  to  the  VMC  problem  formulation,  the  problem  of  minimizing  the  vulner¬ 
ability  of  user  u  with  social  utility  constraint  can  be  stated  as  finding  the  set  of  at 
most  k  vulnerable  friends  Du  C  Du0  to  unfriend,  who  are  responsible  for  maximiz¬ 
ing  the  user  vulnerability,  and  minimizing  user  u’s  social  utility  loss.  I  can  formally 
state  the  problem  of  Vulnerability  Maximization  with  minimum  social  Utility  loss 
and  Cardinality  constraint  (VMUC)  as 

VMUC(Dn!o?  P,  S,  k,  Vu,  Cu ):  -  Instance:  Given  a  finite  set  of  initial  vul¬ 
nerable  friends  Du0  C  Fu  of  a  user  u,  P-index  Pj  G  [0,1],  Vi  G  Du0,  a  vector 
P  =  (Pi,  •  •  •  ,P\Du,0\)i  social  utility  measure  Si  G  R*  Vi  G  DU} 0,  a  vector  S  = 
(S i,  •  •  •  ,S\cufi\),  and  constants  k  G  Z*,  Vu  G  M*,  and  Cu  G  M*.  -  Question:  Find 
a  subset  Du  C  PU)0  such  that  \DU\  <  k,  cr(Du )  >  VM,  and  Si  <  Cu. 

I  now  focus  on  solving  the  VMUC(Du,o,  P,  S,  k,  Vu ,  Cu)  problem.  First,  I 
prove  the  function  cr(-)  is  non-negative,  non-decreasing  and  submodular. 

Theorem  1  (Monotonicity)  The  function  a  :  DU)o  M*  is  monotonically  non¬ 
decreasing.  i.e,  cr(A)  <  a(A  U  {u})7  where  A  C  P„  0  and  v  G  DUt 0. 

Proof.  As  discussed  above,  for  a  user  u,  each  friend  can  be  classified  into  vulnerable 
or  non-vulnerable  friend.  From  Definition  5,  the  V-index  of  a  user  u  decreases  upon 
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removing  the  vulnerable  friend  v.  Therefore,  h(SuA)  <  h(SuAU  {v}),  where  S  and 
A  are  sets  of  non- vulnerable  (including  user  u )  and  vulnerable  friends,  respectively. 
This  means  that  for  user  u,  the  vulnerability  of  set  A  U  {v}  is  more  than  that  of  set 

A.  Hence,  cr(A)  <  a(A  U  {u}).  □ 

Theorem  2  (Submodularity)  If  the  function  h(-)  is  submodular  in  terms  of  vul¬ 
nerable  friends,  the  function  a(-)  is  submodular,  i.e.,  a(A  U  {u})  —  a(A)  >  a(B  U 
{v})  —  o-(B),  where  A  C  B  C  DU) 0,  and  v  G  DUt0. 

Proof.  If  the  function  h(-)  is  submodular  in  terms  of  vulnerable  friends,  the  marginal 
gain  in  vulnerability  of  user  u,  by  adding  a  vulnerable  friend  v  to  an  initial  vulnerable 
set  A,  is  at  least  as  high  as  the  marginal  gain,  by  adding  the  same  vulnerable  node 
v  to  an  initial  vulnerable  superset  B,  i.e.,  h(S  UdU  {u})  —  h(S  U  A)  >  h(S  U5U 
{u})  —  h(S  U  B ),  where  S  is  the  set  of  non-vulnerable  friends  (which  includes  user 
u) ,  A  and  B  are  sets  of  vulnerable  friends,  and  A  C  B.  This  means  that  the  new 
vulnerable  friend  v  causes  more  increase  when  added  to  a  set  A  than  to  a  superset 

B.  Thus,  cr(-)  is  submodular.  □ 

Let  us  examine  the  assumption  that  the  function  h(-)  is  submodular  in  terms  of 

vulnerable  friends.  Assume  that  user  v  is  user  u' s  vulnerable  friend.  If  user  u  has  25 
vulnerable  friends  as  opposed  to  50,  where  25  vulnerable  friends  are  a  subset  of  50, 
then  based  on  the  assumption  about  the  function  h(-),  user  v  is  more  vulnerable  for 
user  u  when  u  has  fewer  vulnerable  friends  than  when  u  has  more.  In  other  words,  the 
vulnerability  of  user  v  can  be  mitigated  clue  to  the  presence  of  more  u' s  vulnerable 
friends. 

Based  on  Theorems  1  and  2,  the  VMC  problem  is  tantamount  to  the  maximiza¬ 
tion  of  non-negative,  non-decreasing,  submodular  function  with  cardinality  constraint. 
A  hill  climbing  algorithm  can  solve  this  problem  with  provable  constant  approxima- 
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tion  [57].  First  start  with  an  empty  output  set  Du\  add  one  element  from  an  initial  set 
of  vulnerable  friends  Du  0  to  the  output  set  that  provides  the  largest  marginal  increase 
in  the  function  value;  repeat  the  previous  step  until  all  the  elements  from  an  initial 
set  of  vulnerable  friends  Duq  are  processed  or  the  maximum  cardinality  bound  k  is 
reached.  According  to  [57],  this  greedy  algorithm  gives  a  (1  —  1/e)— approximation 
for  maximization  of  a(-)  function  with  a  given  cardinality  constraint. 

Similarly,  with  Theorems  1  and  2,  the  VMUC  problem  is  equivalent  to  the  max¬ 
imization  of  non-negative,  non-decreasing,  submodular  function  with  knapsack  like 
constraints.  The  greedy  algorithm  presented  in  [70]  can  be  applied  to  solve  the 
VMUC  problem  with  submodular  objective  function  constrained  by  cardinality  and 
social  utility.  The  proposed  algorithm  also  gives  (1  —  l/e)-approximation  guarantee. 

The  VMUC  problem  remains  NP-hard  [24]  even  when  the  objective  function  cr(-) 
is  linear,  constrained  by  social  utility  and  cardinality.  It  can  be  reduced  to  a  single 
dimensional  knapsack  problem.  The  following  scaling  and  rounding  algorithm  is  a 
fully  polynomial  time  approximation  scheme  [75]  (FPTAS)  for  the  VMUC  problem 
with  a  linear  objective  function  with  knapsack  like  constraints:  for  each  vulnerable 
user  i  e  DUt0,  define  new  P-index  P[  =  |_§J,  where  K  =  — ,  n  =  \Dup\,  P  = 
'Yhi&D  o  Pi  and  a  given  error  parameter  e  >  0;  with  the  new  P-index,  using  a  dynamic 
programming  algorithm  similar  to  the  single  dimensional  knapsack  problem,  find  the 
most  vulnerable  set  Du  C  _Du0  and  \DU\  <  k. 

For  applying  the  scaling  rounding  algorithm,  P-index  values  of  all  vulnerable 
friends  need  to  be  integers.  The  P-index  value  of  each  user  i  e  DUy o  is  a  non  negative 
real  number  in  the  closed  range  0  to  1.  Since  the  proposed  problem  is  a  discrete  opti¬ 
mization  problem,  I  can  simply  convert  these  P-index  values  into  integers  by  shifting 
the  decimal  points  equally  to  the  right.  For  experiments,  on  the  VMUC  problem 
with  linear  objective,  I  multiply  each  P-index  value  by  1000  and  then  take  the  floor  of 
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the  resulting  value  as  new  integer  value  for  P-index.  Errors  caused  due  to  the  scaling 
and  rounding  are  negligible. 

Solutions  presented  for  VMC  and  VMUC  problems,  with  corresponding  assump¬ 
tions,  are  summarized  in  the  Table  2.3.  The  VMUC  problem  for  non-linear  social 
utility  gain  constraints  remains  an  open  problem. 


VMC 

VMUC  Problem 

Objective 

Linear 

1 

FPTAS 

function  cr(-) 

Submodular 

(1  -  1/e) 

(1  -  1/e) 

Table  2.3:  Best  Known  Approximation  Schemes  and  Bounds  for  the  VMC  and  VMUC 
Problems.  The  Objective  Function  cr(-)  is  Submodular  Provided  that  the  Function 
h(-)  is  also  Submodular  in  Terms  of  Vulnerable  Friends. 

2.4  Experiments 

The  proposed  methods  are  demonstrated  in  practice  through  experiments  using  a 
dataset  derived  from  a  real  social  networking  site.  The  proposed  experiments  address 
the  challenge  of  vulnerability  reduction  with  and  without  social  utility  constraints. 
With  an  approach  for  identifying  vulnerable  friends,  I  set  out  to  investigate  the  fol¬ 
lowing  issues: 

•  How  effective  are  the  measures  in  reducing  vulnerability  of  users?  What  is  an 
effective  way  of  reducing  one’s  vulnerability?  How  does  it  compare  random 
unfriending  in  reducing  vulnerability  of  users? 

•  Do  the  indexes  address  the  dynamics  of  social  networks?  1  study  the  impact  of 
a  new  friend  request  and  its  effect  on  vulnerability  of  a  user. 
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•  How  effective  are  the  unfriending  algorithms  in  recommending  at  most  k  (>  0) 
vulnerable  friends  to  minimize  user  vulnerability  while  maintaining  an  accept¬ 
able  level2  of  social  utility  loss? 

•  Does  the  user  vulnerability  reduction  change  significantly  for  different  social 
utility  measures? 

•  At  most,  how  many  vulnerable  friends  should  a  user  unfriend  to  achieve  a 
desired  vulnerability  reduction  while  maintaining  an  acceptable  level  of  social 
utility  loss? 

Next  I  discuss  the  dataset  used  for  experiments,  use  the  proposed  index  estima¬ 
tion  methods  in  an  empirical  study  in  an  attempt  to  address  these  issues,  report 
preliminary  results,  and  suggest  new  lines  of  research  in  hireling  vulnerable  users. 

2.4.1  Facebook  Dataset 

According  to  Quantcast3,  over  145  million  unique  users  in  United  States  visit 
Facebook  within  a  month.  This  puts  Facebook  among  top  3  websites  based  on  the 
number  of  people  in  the  United  States  who  visit  each  site  within  a  month.  Face- 
book  users  spend  over  700  billion  minutes  per  month.  The  statistics  suggest  that 
Facebook4  users  provide  a  rich  set  of  personal  information  through  their  profile,  and 
social  activities.  Thus,  I  use  a  Facebook  dataset  for  evaluating  the  proposed  work. 

The  Facebook  dataset5  is  created  by  crawling  Facebook  user  profiles.  Crawl¬ 
ing  is  performed  in  breadth-first  search  manner  starting  from  randomly  selected 
users  (roots).  The  dataset  contains  publicly  available  profiles  as  well  as  network  in- 

2In  this  work,  I  set  the  acceptable  level  of  social  utility  loss  less  than  or  equal  to  10% 
3http://www. quantcast.com/facebook. com,  a  media  sharing  and  web  analytics  service  company. 
4http:/ /newsroom. fb. com/content /default. aspx?NewsAreaId=22 
5I  use  the  same  dataset  as  in  [32] 
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(a)  Actual  vs  Crawled  Friends 


(b)  Actual  vs  Vulnerable  Friends 
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(c)  Crawled  vs  Vulnerable  Friends 
Figure  2.4:  Facebook  Dataset  Fact 


formation  for  more  than  two  million  users.  1  design  two  major  tasks:  vulnerability 
reduction  without  social  utility  constraints,  and  vulnerability  reduction  with  social 
utility  constraints.  For  the  first  task,  I  do  not  filter  any  Facebook  user  profiles  from 
the  dataset.  However,  for  the  second  task  I  remove  the  Facebook  users  who  do  not 
share  their  friends’  information  from  the  dataset,  since  unfriending  is  not  possible  on 
such  users.  Table  2.4  shows  the  statistics  of  randomly  selected  300K  users  from  the 
dataset  used  for  the  second  task. 
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Facebook  dataset  Count 

Avg.  actual  friends  per  user  1056 

Avg.  crawled  friends  per  user  154 

Avg.  vulnerable  friends  per  user  98 
Max.  actual  friends  per  user  5000 

Min.  actual  friends  per  user  11 

Max.  crawled  friends  per  user  4971 

Min.  crawled  friends  per  user  11 

Max.  vulnerable  friends  per  user  3222 
Min.  vulnerable  friends  per  user  0 


Tabic  2.4:  Statistics  for  Randomly  Selected  300K  Facebook  Users 


For  a  given  user,  I  may  not  obtain  the  information  of  all  friends  due  to  their 
privacy  settings.  Friends  for  which  I  obtain  the  information  are  referred  as  crawled 
friends.  Figure  2.4  shows  further  facts  about  the  randomly  selected  300K  users.  X- 
axis  and  Y-axis  indicate  users  and  their  number  of  friends,  respectively.  For  simplicity, 
before  plotting  Figures  2.4a  and  2.4b,  I  sort  all  users  in  the  ascending  order  based 
on  the  number  of  Facebook  friends,  while  for  Figure  2.4c,  I  sort  all  users  in  the 
ascending  order  based  on  the  number  of  crawled  Facebook  friends.  Figure  2.4a  shows 
the  relationship  between  actual  Facebook  friends  (red)  and  crawled  friends  (blue)  in 
dataset.  In  experiments,  I  estimate  the  V-index  of  each  user  as  an  average  of  all 
the  P-index  values  of  crawled  friends.  Based  on  the  V-index,  I  compute  the  number 
of  vulnerable  friends  for  each  user,  as  described  in  Section  2.1.  Figure  2.4b  shows 
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the  relationship  between  actual  Facebook  friends  (red)  and  their  vulnerable  friends 
(blue).  Figure  2.4c  shows  the  relationship  between  crawled  Facebook  friends  (red) 
and  their  vulnerable  friends  (blue). 

2.4.2  Vulnerability  Reduction  without  Social  Utility  Constraints 

I  divide  this  major  task  into  three  experiments  to  test  the  (1)  impact  of  differ¬ 
ent  unfriending  strategies  on  users’  vulnerability,  (2)  performance  of  unfriending  the 
most  vulnerable  friend  with  different  unfriending  strategies,  and  (3)  impact  of  new 
friendship  on  secure  users  from  vulnerable  users. 

Impact  of  different  unfriending  strategies 

For  the  first  set  of  experiments,  I  compare  V-index  for  each  of  user  with  two  optimal 
algorithms  and  six  intuitive  strategies  for  unfriending  to  reduce  vulnerability.  For  each 
graph  in  Figure  2.5,  the  X-axis  and  Y-axis  indicate  users  and  their  V-index  values, 
respectively.  For  simplicity,  I  sort  all  users  in  ascending  order  based  on  existing 
V-index,  and  then  I  plot  their  corresponding  V-index  before  and  after  unfriending. 
Figure  2.5  indicates  performance  of  all  eight  algorithms  which  will  help  us  to  decide 
whether  unfriending  makes  users  more  or  less  vulnerable.  The  eight  algorithms  are, 

•  Most  vulnerable  friend.  For  a  user,  the  most  vulnerable  friend  is  the  one  whose 
removal  lowers  the  V-index  score  the  most.  For  each  user,  I  first  find  the 
most  vulnerable  friend  and  then  estimate  the  new  V-index  value  (Ml-index) 
after  unfriending  him/her.  As  expected,  see  Figure  2.5a,  V-index  values  for 
users  decrease  in  comparison  with  V-index  values  before  unfriending  the  most 
vulnerable  friend.  Unfriending  the  most  vulnerable  friend  makes  all  users  more 


secure. 
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Figure  2.5:  Performance  Comparisons  of  V-Indexes  for  Each  User  Before  (Red)  and 
After  (Blue)  Unfriending  based  on  Eight  Different  Algorithms. 
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•  Two  most  vulnerable  friends.  If  I  sort  all  of  user’s  vulnerable  friends  in  ascending 
order  based  on  their  new  V-indexes  (after  unfriending),  the  top  two  in  the  list  are 
the  two  most  vulnerable  friends.  For  each  user,  I  first  find  two  most  vulnerable 
friends  and  then  estimate  the  new  V-index  value  (M2-index)  after  unfriending 
them.  As  expected,  see  Figure  2.5b,  V-index  values  for  all  users  decrease  in 
comparison  with  V-index  values  before  unfriending  the  two  most  vulnerable 
friends.  Unfriending  the  two  most  vulnerable  friends  also  make  all  users  more 
secure. 

•  Least  V-friend.  For  each  user,  I  choose  to  unfriend  the  friend  whose  V-index 
is  the  lowest  among  all  friends.  This  friend  is  the  least  V-friend.  V-index 
values  increase  for  65%  of  100K  users,  and  increase  for  43%  of  the  2M+  users, 
in  comparison  with  V-index  values  before  unfriending  the  least  V-friend.  See 
Figure  2.5c,  L-index  refers  to  the  new  V-index  value  after  unfriending  the  least 
V-friend.  V'u  >  Vu  for  some  users  because,  Pi  <  Vu  where  Pi  is  the  P-index 
of  the  least  V-friend.  Unfriending  the  least  V-friend  does  not  make  all  users 
insecure. 

•  Random  friend.  For  each  user,  I  randomly  choose  to  unfriend  a  friend.  V- 
index  values  increase  for  24%  of  100K  users,  and  increase  for  23.5%  of  the  2M+ 
users,  in  comparison  with  V-index  values  before  unfriending  a  random  friend. 
See  Figure  2.5d,  R-index  refers  to  the  new  V-index  value  after  unfriending  a 
random  friend.  V'u  >  Vu  because  Pr  <  Vu,  where  Pr  is  the  P-index  of  the 
random  friend.  Unfriending  a  random  friend  does  not  make  all  users  secure. 

•  Max  P-friend.  For  each  user,  I  choose  to  unfriend  a  friend  whose  P-index  is  the 
highest  among  all  friends.  V-index  values  increase  for  5%  of  100K  users,  and 
increase  for  11%  of  the  2M+  users,  in  comparison  with  V-index  values  before 
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unfriending  the  max  P-friend.  See  Figure  2.5e,  MPl-index  refers  to  the  new  V- 
index  value  after  unfriending  the  max  P-friend.  V'u  >  Vu  for  some  users  because 
Pmp i  <  Vu,  where  Pmpi  is  the  P-index  of  the  max  P-friend.  Unfriending  the 
max  P-friend  makes  a  majority  of  users  more  secure. 

•  Two  max  P-friend.  For  each  user,  I  choose  to  unfriend  two  friends  whose  P-index 
is  the  highest  and  second  highest  among  all  friends.  V-index  values  increase  for 
5%  of  100K  users,  and  increase  for  11%  of  the  2M+  users,  in  comparison  with 
V-index  values  before  unfriending  the  two  max  P-fricnds.See  Figure  2.5f,  MP2- 
index  refers  to  the  new  V-index  value  after  unfriending  the  two  max  P-friend. 
Vf  >  Vu  for  some  users  because  (Pmpi  +  Pmp2)/^  <  Vu,  where  Pmp\x  and  Pmp2 
are  P-indexes  of  the  two  max  P-friends.  Unfriending  the  two  max  P-friends 
makes  a  majority  of  users  more  secure. 

•  Max  V-friend.  For  each  user,  I  choose  to  unfriend  a  friend  whose  V-index  is  the 
highest  among  all  friends.  V-index  values  increase  for  3.6%  of  100K  users,  and 
increase  for  5%  of  the  2M+  users,  in  comparison  with  V-index  values  before 
unfriending  max  V-friend.  See  Figure  2.5g,  MVl-index  refers  to  the  new  V- 
index  value  after  unfriending  the  max  V-friend.  Vf>  Vu  for  some  users  because 
Pmv i  <  Vu,  where  Pmvi  is  the  P-index  of  the  max  V-friend.  Unfriending  the 
max  V-friend  makes  a  majority  of  users  more  secure. 

•  Two  max  V-friend.  For  each  user,  I  choose  to  unfriend  two  friends  whose 
V-index  is  the  highest  and  second  highest  among  all  friends.  V-index  values 
increase  for  2.5%  of  100K  users,  and  increase  for  5%  of  the  2M+  users,  in  com¬ 
parison  with  V-index  values  before  unfriending  the  two  max  V-friends.  See 
Figure  2.5h,  GV2-index  refers  to  the  new  V-index  value  after  unfriending  the 
two  max  V-friends.  Vf  >  Vu  for  some  users  because  ( Pmvi  +  Pmv2)/ 2  <  Vu, 
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where  Pmvi  and  PmV2  are  P-indexes  of  the  two  max  V-friends.  Unfriending  the 
two  max  V-friends  make  a  majority  of  users  more  secure. 

Performance  comparison  with  the  best  unfriending  strategy 

In  the  second  set  of  experiments,  I  compare  the  performance  of  unfriending  most 
vulnerable  friends  with  the  seven  intuitive  unfriending  strategies.  For  each  graph  in 
Figure  2.6,  the  X-axis  and  Y-axis  indicate  users  and  their  associated  V-index  values 
after  unfriending,  respectively.  I  sort  all  users  in  ascending  order  based  on  V-index 
values  after  unfriending  the  most  vulnerable  friend  and  then  plot  corresponding  V- 
index  based  on  different  unfriending  strategies.  I  find  unfriending  the  most  vulnerable 
friend  makes  users  more  secure. 

As  expected,  see  Figure  2.6a-2.6d,  V-index  values  for  each  user  based  on  unfriend¬ 
ing  the  least  V-friend,  a  random  friend,  the  max  P-friend,  or  the  max  V-friend  increase 
for  all  users  in  comparison  with  their  V-index  values  after  unfriending  the  most  vul¬ 
nerable  friend.  In  the  case  of  unfriending  the  least  V-friend,  V-index  values  increase 
for  3%  of  users  in  comparison  with  the  most  vulnerable  friend  unfriending.  Similarly, 
1.7%  of  users  increase  for  a  random  friend  unfriending,  1.7%  of  users  increase  for 
the  P-friend  unfriend,  and  1%  of  users  increase  for  the  V-friend  unfriending.  Thus, 
unfriending  the  most  vulnerable  friend  makes  all  users  more  secure  than  all  other 
schemes. 

V-index  values  for  each  user  based  on  unfriending  the  two  most  vulnerable  friends, 
see  Figure  2.6e,  do  not  decrease  for  10%  of  100K,  and  21%  of  2M+  users,  in  comparison 
with  V-index  values  after  unfriending  the  most  vulnerable  friend.  V-index  values 
for  each  user  based  on  unfriending  the  two  max  P-friend,  see  Figure  2.6f,  do  not 
decrease  for  51%  of  100K,  and  81%  of  2M+  users,  in  comparison  with  V-index  values 
after  unfriending  the  most  vulnerable  friend.  V-index  values  for  each  user  based  on 
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unfriending  the  two  max  V-friend,  see  Figure  2.6g,  do  not  decrease  for  90%  of  100K, 
and  75%  of  2M+  users,  in  comparison  with  V-index  values  after  unfriending  the  most 
vulnerable  friend. 

Impact  of  new  friends 

I  now  investigate  the  impact  of  new  friendship  on  two  types  of  secure  users  from 
vulnerable  users.  I  select  three  sets  of  10K  users  from  2M+  Facebook  users:  (SI) 
users  with  high  V-indexes,  (S2)  users  with  low  V-indexes,  and  (S3)  C-attributes 
enabled  users  with  low  V-indexes.  I  randomly  select  a  vulnerable  user  (i.e.,  selected 
from  SI,  10K  high  V-index  users)  and  a  secure  user  (i.e.,  selected  from  S2,  10K  low 
V-index  users),  and  pair  them  and  remove  the  pair  from  SI  and  S2,  respectively, 
until  all  10K  users  from  SI  and  S2  are  paired.  I  repeat  the  same  with  sets  SI  and 
S3.  The  two  sets  of  results  are  shown  in  Figure  2.7  (a)  and  (b).  For  each  graph,  the 
X-  and  Y-axis  indicate  users  and  their  V-index  values  before  and  after  the  pairing 
of  new  friends,  respectively.  I  sort  all  users  in  ascending  order  based  on  their  old 
V-indexes.  As  shown  in  Figure  2.7a,  V-indexes  of  all  users  of  S2  increase  significantly 
and  consistently;  in  Figure  2.7b,  V-indexes  of  users  of  S3  also  increase,  but  vary  from 
minor  to  large  changes.  The  larger  changes  in  the  latter  case  occur  on  those  users  of 
S3  with  fewer  friends.  The  results  in  Figure  2.7  confirm  that  less  vulnerable  users  can 
become  more  vulnerable  if  they  are  careless  when  making  new  friends,  and  reclusive 
users  are  more  sensitive  to  the  choice  of  new  friends  than  less  reclusive  ones. 
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Figure  2.6:  Performance  Comparisons  of  Unfriending  the  Most  Vulnerable  Friend 
(Red)  with  Seven  Different  Unfriending  Ways  (Blue). 
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Figure  2.7:  Impact  of  New  Friendship  (Blue)  on  Users  with  Low  V-Indexes  (Red) 
from  Users  with  High  V-Indexes 


2.4.3  Vulnerability  Reduction  with  Social  Utility  Constraints 

To  evaluate  theoretical  findings  on  vulnerability  reduction  with  social  utility  con¬ 
straints,  I  again  design  three  experiments.  First,  in  Figure  2.8,  I  compare  the  V-index 
value  of  each  user  before  and  after  the  unfriending  of  k  most  vulnerable  friends.  I 
refer  this  as  the  baseline.  I  do  not  consider  social  utility  loss  constraint  when  ob¬ 
taining  the  baseline.  Second,  in  Figure  2.9,  I  compare  the  V-index  value  of  each  user 
before  and  after  the  unfriending  of  at  most  k  vulnerable  friends  while  maintaining 
the  acceptable  level  of  social  utility  loss.  I  refer  this  approach  as  a  social  utility  loss 
based  approach.  Third,  in  Figure  2.10,  I  compare  the  reductions  of  the  baseline  and 
social  utility  loss  approach  for  each  user.  I  observe  from  these  experiments  that  it  is 
possible  to  suggest  the  maximum  number  of  vulnerable  friends  to  unfriend  to  achieve 
desired  user  vulnerability  reduction  while  maintaining  an  acceptable  level  of  social 
utility  loss. 

For  each  graph  in  Figures  2.8,  2.9,  and  2.10,  X-axis  and  Y-axis  indicate  users 
and  their  V-index  values,  respectively.  Without  loss  of  generality,  I  sort  all  users,  in 
the  ascending  order  based  on  the  existing  V-index  values,  before  I  plot  the  graphs  in 
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Figure  2.8:  (The  Baseline)  Performance  Comparison  of  V-Index  Values  for  Each  User 
Before  (Red)  and  After  (Blue)  Unfriending  the  k  Most  Vulnerable  Friends  from  His 
Social  Network. 


Figures  2.8  and  2.9.  For  Figure  2.10,  1  sort  all  users  in  the  ascending  order  based  on 
the  V-index  values  computed  using  the  baseline  approach. 

The  baseline 

It  consists  of  results  from  solving  the  VMC  problem  with  linear  objective  function 
cr(-).  As  shown  in  Section  2.1,  such  a  problem  can  be  solved  in  polynomial  time,  k 
(or  \DUfi\,  if  |-Du,o|  <  k )  most  vulnerable  friends  are  selected  for  removal  to  minimize 
the  objective  function.  I  run  this  experiment  on  randomly  selected  300K  users  of  the 
Facebook  dataset.  Figure  2.8  shows  the  performance  comparison  of  V-index  values 
for  each  user  before  (red)  and  after  (blue)  unfriending  at  most  k  vulnerable  friends.  I 
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(c)  Unfriend  10 


(d)  Unfriend  50 


Figure  2.9:  Performance  Comparisons  of  V-Index  Values  for  Each  User  Before  (Red) 
and  After  (Blue)  Unfriending  at  Most  k  Vulnerable  Friends  with  the  Total  Degree  of 
Vulnerable  Users  as  a  Social  Utility  Constraint.  I  Set  an  Error  Parameter  e  =  0.1 
(Input  Parameter  to  FPTAS)  and  Retain  At  Least  90%  of  the  Total  Degrees  of  All 
the  Vulnerable  Friends  After  Unfriending. 


run  the  experiments  for  different  values  of  k  including  1,  2,  10,  and  50.  As  expected, 
vulnerability  decreases  consistently  as  the  value  of  k  increases  as  seen  in  Figures  2.8a- 
2.8d.  For  a  given  k,  the  baseline  is  expected  to  achieve  maximum  user  vulnerability 
reduction,  but  cannot  guarantee  the  retention  of  socially  valuable  vulnerable  friends. 
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A  social  utility  based  approach 


I  compute  the  minimized  user  vulnerability  by  solving  the  VMUC  problem.  For 
experiments,  the  V-index  for  each  user  is  estimated  as  an  average  of  all  the  P-index 
values  of  crawled  friends.  Hence,  the  objective  function  a(-)  for  VMUC  is  linear. 
As  discussed  in  Section  2.3,  the  scaling  and  rounding  algorithm  presented  is  FPTAS 
for  such  a  relaxed  VMUC  problem.  Figure  2.9  shows  the  performance  comparisons 
of  minimized  V-index  values  for  each  user  before  (red)  and  after  (blue)  unfriending 
at  most  k  vulnerable  friends  with  the  sum  of  all  their  degrees  as  a  social  utility 
constraint.  Due  to  the  theoretical  guarantees  of  FPTAS,  1  set  error  parameter  e  to 
relatively  low  value  and  aim  to  retain  as  high  number  of  valuable  friends  as  possible. 
1  set  error  parameter  e  =  0.1  (input  parameter  to  FPTAS)  and  retain  at  least  90% 
of  the  total  degrees  of  all  the  vulnerable  friends  after  unfriending.  As  FPTAS  runs 
slower  for  a  smaller  error  parameter,  1  run  the  experiments  for  randomly  selected  10K 
users  out  of  300K  users.  As  expected,  vulnerability  drops  consistently  as  the  value  of 
k  increases.  I  show  the  experiment  results  for  four  different  k  values  including  1,  2, 
10,  and  50  in  Figures  2.9a-  2.9d.  I  also  run  the  experiments  with  other  forms  of  social 
utility  measures  such  as  tie  strength  and  number  of  common  friends.  Tie  strength 
between  two  friends  follows  a  random  distribution  by  having  a  value  between  0  to 
1  for  each  user,  where  1  represents  the  maximum  social  tie  strength.  1  observe  the 
similar  patterns  in  user  vulnerability  reduction  for  these  two  social  utility  measures. 

Comparing  the  social  utility  approach  with  the  baseline 

The  purpose  of  this  experiment  is  to  evaluate  how  effective  the  social  utility  approach 
is  in  reducing  vulnerability  and  retaining  social  utility.  Figure  2.10  shows  the  per¬ 
formance  comparison.  The  results  for  four  different  k  values  (1,  2,  10,  and  50)  are 
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(a)  Unfriend  1 


(b)  Unfriend  2 


(c)  Unfriend  10 


(d)  Unfriend  50 


Figure  2.10:  Performance  Comparison  Between  the  Baseline  and  the  Social  Utility 
Approach. 


depicted  in  Figures  2.10a-  2.10d.  I  use  the  total  number  of  degrees  of  vulnerable 
friends  as  a  social  utility  measure.  As  expected,  vulnerability  reduction  for  the  base¬ 
line  is  more  than  the  social  utility  approach.  However,  we  can  still  achieve  significant 
reduction  with  the  social  utility  constraint.  1  observe  similar  results  for  the  other  two 
social  utility  measures,  i.e.,  tie  strength  and  number  of  common  friends. 

Before  unfriending  any  vulnerable  friend,  the  average  V-index  value  of  all  users  is 
0.3485.  Table  2.5  shows  the  average  V-index  values  for  the  baseline  and  social  utility 
approach.  The  results  for  the  three  different  social  utility  measures  are  presented. 
It  also  reports  the  percentage  decrease  in  average  V-index  value  for  each  approach. 
The  marginal  reduction  in  the  average  V-index  decreases  as  the  value  of  k  increases. 
Results  show  that  baseline  is  always  the  most  effective  in  removing  the  vulnerable 
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Approach 

Unfriending  k  Friends 

k  =  1 

k  =  2 

k  =  5 

k  =  10 

k  =  50 

Baseline 

0.3425 

(|  1.71%) 

0.3371 

(|  3.26%) 

0.3216 

(|  7.74%) 

0.2944 

(|  15.52%) 

0.1891 

(|  45.73%) 

Degree 

0.3427 

(|  1.67%) 

0.3381 

(|  2.98%) 

0.3261 

(|  6.45%) 

0.3048 

(|  12.54%) 

0.2185 

(|  37.31%) 

Priority 

0.3426 

(|  1.68%) 

0.3378 

(|  3.07%) 

0.3250 

(|  6.77%) 

0.3025 

(|  13.20%) 

0.2146 

(|  38.43%) 

Common  Friends 

0.3426 

(|  1.68%) 

0.3378 

(|  3.08%) 

0.3249 

(|  6.8%) 

0.3027 

(|  13.13%) 

0.2151 

(|  38.28%) 

Table  2.5:  The  Average  V-index  Values  for  the  Baseline  and  Three  Different  Social 
Utility  Measures.  Numbers  in  Brackets  are  Percentage  Decreases  in  Each  Average 
V-index  Value.  The  Average  V-index  Value  of  All  Users  Before  Unfriending  Any 
Vulnerable  Friend  is  0.3485. 

friends.  This  is  because  it  removes  the  most  vulnerable  friends  without  considering 
the  social  utility  loss  for  a  given  k.  But  this  may  incur  the  cost  of  loss  in  social  utility 
value  for  a  user.  The  social  utility  approach  aims  to  retain  the  90%  of  user  Vs  social 
utility  value  for  a  given  k  in  when  minimizing  Vs  vulnerability.  Table  2.5  provides  a 
summary  of  comparative  results  for  k  =  1,2,  5, 10,  50.  I  observe  the  following:  (1)  the 
more  vulnerable  friends  to  remove,  the  less  vulnerable  a  user  is  for  all  four  cases  (the 
baseline  plus  3  social  utility  measures);  (2)  by  allowing  for  a  10%  loss  of  a  social  utility 
measure,  one  can  still  achieve  comparable  reduction  to  the  baseline;  (3)  vulnerability 
reduction  by  all  three  social  utility  measures  are  similar;  (4)  for  a  given  k ,  the  baseline 
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achieves  the  largest  reduction;  and  (5)  but  the  gain  over  a  social  utility  measure  can 
be  easily  eliminated  by  removing  the  next  larger  k  friends.  For  example,  unfriending 
2  vulnerable  friends  with  any  social  utility  measure  can  attain  vulnerability  reduction 
that  is  larger  than  that  of  the  baseline  with  k  —  1. 

2.5  Summary 

There  are  vulnerable  friends  on  social  networking  sites  and  it  is  important  to  find 
and  unfriend  vulnerable  friends  so  that  users  can  improve  their  privacy  and  security. 
However,  unfriending  vulnerable  friends  from  a  user’s  social  network  can  significantly 
decrease  the  user’s  social  utility.  In  this  chapter,  I  studied  the  novel  problem  of  vul¬ 
nerability  reduction  with  and  without  social  utility  loss  constraints.  First,  I  provided 
general  model  for  vulnerability  reduction.  Using  this  model,  I  formulated  the  two 
discrete  optimization  problems,  viz.,  VMC  and  VMUC.  The  VMC  problem  only 
considers  the  cardinality  constraint  while  the  VMUC  problem  considers  cardinality 
as  well  as  social  utility  constraints.  Both  problems  are  NP-hard.  Proposed  exper¬ 
iments  on  the  Facebook  dataset  evaluate  the  effectiveness  of  different  methods  of 
vulnerability  reduction  with  and  without  social  utility  constraints. 

I  proposed  a  feasible  approach  to  a  novel  problem  of  identifying  a  user’s  vulnerable 
friends  on  a  social  networking  site.  This  work  differs  from  existing  work  addressing 
social  networking  privacy  by  introducing  a  vulnerability-centered  approach  to  a  user 
security  on  a  social  networking  site.  On  most  social  networking  sites,  privacy  related 
efforts  have  been  concentrated  on  protecting  individual  attributes  only.  However, 
users  are  often  vulnerable  through  community  attributes.  Unfriending  vulnerable 
friends  can  help  protect  users  against  the  security  risks.  Based  on  this  study  of  over 
2  million  users,  I  find  that  users  are  either  not  careful  or  not  aware  of  security  and 
privacy  concerns  of  their  friends.  The  proposed  model  clearly  highlights  the  impact 
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of  each  new  friend  on  a  user’s  privacy.  The  proposed  approach  does  not  require 
the  structural  change  of  a  social  networking  site  and  aims  to  maximally  reduce  a 
user’s  vulnerability  while  minimizing  his  social  utility  loss.  The  work  formulates  a 
novel  problem  of  constrained  vulnerability  reduction  suggests  a  feasible  approach,  and 
demonstrates  that  the  problem  of  constrained  vulnerability  reduction  is  solvable. 
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Chapter  3 


IDENTIFYING  VULNERABILITY  OF  ACTIVE  USERS  WITH  PRIVATE 

SETTINGS 


In  this  chapter,  my  focus  is  on  active  users  which  mark  majority  of  their  profile 
settings  private  (i.e.,  Q2  users).  For  the  rest  of  this  chapter,  they  are  simply  referred 
as  users,  unless  otherwise  stated.  Social  media  wants  their  users  to  be  more  social 
at  the  same  time  less  concerned  about  unwarranted  access  to  their  personal  data. 
Recent  social  media  advancements  are  creating  new  opportunities  for  meaningful  in¬ 
teractions  among  users,  while  enabling  new  profile  settings  for  users  to  better  protect 
their  personal  information.  New  mechanisms  such  as  Facebook  page  allow  users’  to 
interact  through  posts  without  requiring  them  to  be  friends,  while  keeping  their  per¬ 
sonal  information,  including  demographic  profiles,  lists  of  friends,  and  interactions 
with  friends  private.  Users’  interactions  on  these  pages  are  often  centrally  adminis¬ 
tered  and  publicly  available  for  everyone.  Based  on  whether  a  user  can  control  the 
visibility  of  her  actions,  a  post  can  be  categorized  into  two  parts:  personal  or  public 
post.  A  personal  post  is  a  post  which  can  be  controlled  by  a  user’s  individual  profile 
settings,  otherwise  it  is  referred  as  a  public  post.  In  this  chapter,  I  exclusively  focus 
on  public  posts,  and  the  users’  actions,  including  liking,  commenting  and  sharing, 
on  public  posts  are  together  referred  as  their  public  interactions  while  the  interac¬ 
tions  on  personal  posts  are  personal  interactions.  Given  the  pervasive  availability  of 
public  interactions,  I  ask  -  are  users’  personal  attributes  predictable  using  only  the 
interactions  on  public  posts? 

To  answer  the  question,  I  study  the  problem  of  the  predictability  of  users’  personal 
attributes  in  the  context  of  Facebook  pages.  There  are  several  challenges  regarding 
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the  data  in  Faeebook  pages.  The  first  challenge  is  about  the  unavailability  of  users’ 
connections.  Based  on  the  literature,  users’  connections  play  a  vital  role  in  predicting 
personal  attributes.  However,  users’  connections  can  be  marked  invisible  using  profile 
settings.  Also,  Faeebook  pages  do  not  require  users  to  form  any  kind  of  connection 
to  interact  with  each  other.  The  second  challenge  is  about  the  text  complexity.  Dur¬ 
ing  the  recent  events  across  the  globe  including  “Arab  Spring”,  “Assam  riots”,  and 
“Bangladesh  Protests”,  Faeebook  users  primarily  communicated  in  their  local  lan¬ 
guages  such  as  Arabic,  Assamese,  and  Bengali  rather  than  English,  which  makes  text 
data  complex  to  analyze.  Although  interactions  are  pervasively  available,  however, 
public  posts  are  visible  to  the  public  and  all  users  can  perform  actions  on  them.  This 
property  of  public  posts  results  in  the  public  interaction  data  extremely  sparse  and 
further  exacerbates  the  difficulty  of  the  prediction  problem. 

3.1  Problem  Formulation 

1  first  present  the  notations  used  in  this  chapter.  Let  A  e  Mnxm  be  the  matrix, 
where  n  is  the  number  of  rows  and  m  is  the  number  of  columns.  The  entry  at  i-th 
row  and  j-tli  column  of  A  is  denoted  as  A A (i, :)  and  A (:,j)  denote  the  i-th 
row  and  j-th  column  of  A,  respectively.  ||A||^  is  the  Frobenius  norm  of  A,  and 

IIA||f  =  v/KUE™A(LT- 

Typically,  two  types  of  objects  are  involved  in  public  interactions,  i.e.,  users  and 
public  posts  in  Faeebook  pages.  Let  u  =  (wi,  u2, . . . ,  un}  be  the  set  of  users,  and 
v  =  {vi,  v2,  •  •  • ,  vm}  be  the  set  of  public  posts,  where  n  and  m  are  the  total  numbers 
of  users  and  public  posts,  respectively.  Depending  on  the  social  media  site,  users’ 
interactions  involve  different  types  of  actions.  For  example,  Faeebook  users  mainly 
perform  three  types  of  actions  on  public  posts  and  their  associated  items  including 
liking,  commenting,  and  sharing.  For  each  publicly  known  action,  I  can  construct  the 
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user-post  action  matrix  R  e  Mnxm,  where  R?J  =  1  if  i-th  user’s  perform  the  action 
to  j'-tli  post,  otherwise  0.  For  the  simplicity  of  discussion,  I  assume  that  R  contains 
user-post  like  actions. 

Users  can  control  their  personal  data  such  as  demographic  profiles  and  personal 
posts  including  status  updates,  photos  and  videos  via  their  profile  settings  to  avoid 
unwarranted  access.  However,  users’  interactions  in  Facebook  pages  are  beyond  their 
control  and  are  always  available  to  the  public.  Therefore  in  this  chapter,  I  ask  -  are 
users’  personal  attributes  predictable  using  public  interactions  in  Facebook  pages ?  To 
answer  this  question,  I  design  a  task  of  predicting  users’  personal  attributes  with 
public  interaction  data. 

The  problem  of  predicting  users’  attributes  is  extensively  studied.  It  assumes  that 
there  are  N  users  in  u  labeled  with  N  <  n.  I  assume  that  uL  =  {ui,  u2,  ■  ■  . ,  u^}  is  a 
set  of  labeled  users  where  is  a  subset  of  u.  Let  YL  G  MArxX  be  the  label  matrix 
of  where  K  is  the  total  number  values  of  a  given  attribute.  The  vast  majority 
of  existing  attribute  prediction  algorithms  make  use  of  users’  personal  data  such  as 
their  personal  posts  [62,  13]  or  their  social  networks  [39,  53,  74]  to  obtain  a  predictor 
/  to  predict  the  attribute  of  users  in  {u  \  Ul}.  To  seek  an  answer  to  the  question  of 
whether  users’  personal  attributes  are  predictable  using  only  public  interaction  data, 
I  investigate  the  predictability  of  user’s  attribute  with  users’  interactions  to  public 
posts,  which  is  formally  stated  as  -  Given  users’  public  interactions  on  posts,  the 
known  attribute  labels  Y l,  I  aim  to  learn  a  predictor  f  to  automatically  predict  the 
attribute  for  unlabeled  users  i.e,  {u  \  u^}. 

3.2  Framework  for  Attribute  Prediction:  SCOUT 

A  user  usually  performs  like  actions  with  a  small  proportions  of  personal  posts, 
resulting  in  a  sparse  user-post  action  relationships.  One  of  the  key  difference  between 
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public  and  personal  posts  is  that  only  friends  can  perform  interactions  on  personal 
posts,  whereas  all  users  can  perform  interactions  on  public  posts.  Hence,  interaction 
patterns  on  public  posts  are  likely  to  be  more  sparse  than  personal  posts.  Thus,  the 
problem  of  predicting  the  personal  attributes  from  such  sparse  public  interactions  is 
more  challenging  for  traditional  classification  methods  including  support  vector  ma¬ 
chines  (SVM),  logistic  regression,  and  naive  bayes.  The  proposed  framework,  SCOUT, 
aims  to  address  the  sparse  interactions  problem  by  learning  a  compact  representation 
of  users  with  the  help  of  social  theories.  This  compact  representation  is  later  used  to 
build  a  predictor  /  to  automatically  predict  the  personal  attributes. 

3.2.1  Learning  a  Compact  Representation 

The  low-rank  matrix  factorization-based  method  is  one  of  the  popular  way  to 
obtain  the  compact  representation  of  users  [71].  In  this  chapter,  I  adopt  the  well 
known  matrix  factorization  model  [18]  to  obtain  low  rank  representation  of  users.  The 
matrix  factorization  model  seeks  a  low  rank  representation  U  G  M.nxd  with  d  «  n 
via  solving  following  optimization  problem. 

min  ||R-UHVt|||,  (3.1) 

U.H.V 

where  V  G  M.mxd  is  a  low-rank  space  representation  of  the  set  of  public  posts;  and 
H  G  Wdxd  captures  the  correlations  between  the  low  rank  representations  of  users 
and  public  posts  such  as  R(i,j)  =  U(i,  :)HVT(j, :).  To  avoid  over-fitting,  I  add 
smoothness  regularization  on  U,  H.  and  V  into  Eq.  (3.1),  and  then  I  have, 

min  ||(R-UHVT)||J.  +  A(||UfF+||V||J.+  ||H||p,  (3.2) 

U  ,JnL,  V 

where  A  is  non-negative  and  are  introduced  to  control  the  capability  of  U,  V  and 
H.  The  learnt  compact  representation  may  be  inaccurate  because  of  the  sparsity  of 
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W.  The  number  of  zero  entities  in  R  is  much  larger  than  that  of  non-zero  numbers, 
which  indicates  that  U(i,  :)HVT(j, :)  will  fit  to  be  zero.  The  extreme  sparsity  of  R 
will  result  in  the  learnt  representation  U  close  to  a  zero  matrix. 

One  way  to  mitigate  the  data  sparsity  challenge  is  to  give  different  weights  to 
the  observed  and  missing  actions.  In  detail,  1  introduce  a  weight  matrix  W  e  Mnxm 
where  Wy  is  the  weight  to  indicate  the  importance  of  R,y  in  the  factorization  process. 
The  new  formulation  is  presented  in  Eq.  (3.2)  as 

nun  ||W©(R-UHVT)|||  +  A(||U||J,  +  ||V|||  +  ||HfF),  (3.3) 

U  ,n,  V 

where  ©  is  the  Hadamard  product  and  (A  0  B =  A (i,j)  x  B (i,j)  for  any 
two  matrices  A  and  B  with  the  same  size.  W (i,j)  =  1  if  R(i,  j)  =  1.  Following  the 
suggestions  in  [72],  I  set  W (i,j)  to  a  small  value  close  to  zero  when  R(i,  j)  =  0,  which 
allows  negative  samples  in  the  learning  process.  In  this  work,  I  set  W (i,j)  =  0.01 
when  R(i,j)  =  0. 

In  addition  to  like  actions,  users  can  perform  other  actions  such  as  sharing  and 
commenting.  There  are  many  social  theories  such  as  homophily  [52]  and  consis¬ 
tency  [1]  theories  developed  to  explain  users’  actions.  These  social  theories  pave  a 
way  for  us  to  model  user-user  and  post-post  correlations,  which  can  potentially  further 
mitigate  the  data  sparse  problem. 

3.2.2  Modeling  Correlations 

User-user  and  post-post  correlations  in  social  media  are  widely  used  to  improve 
various  tasks  such  as  feature  selection  [73],  sentiment  analysis  [38,  45]  and  recom¬ 
mendation  [49].  Next,  I  propose  a  novel  way  to  compute  the  user-user  and  post-post 
correlations  to  tackle  the  sparsity  problem  further  using  users’  actions  on  public  posts 
and  their  associated  items  such  as  comments  and  shared  posts. 
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Modeling  User-User  Correlations 


Apart  from  likes,  users  also  perform  other  actions  including  commenting,  replying 
and  sharing  on  different  types  of  objects  such  as  posts,  shared  posts,  and  comments. 
This  subsection  provides  a  way  to  include  these  users’  activities  by  modeling  user- 
user  correlations.  Homophily  [52]  is  one  of  the  important  social  theories  developed  to 
explain  users’  actions  during  interactions  in  the  real  world.  Homophily  theory  suggests 
that  similar  users  are  likely  to  perform  similar  actions.  These  intuitions  motivate  us  to 
obtain  low-rank  space  representation  of  users  based  on  their  historical  actions  during 
interactions.  I  define  T(i,  j)  to  measure  the  user-user  correlation  coefficients  between 
Ui  and  Uj.  There  are  many  ways  to  measure  user-user  correlation,  such  as  similarity  of 
users’  behavior  [50]  and  connections  in  social  networks  [49].  In  this  chapter,  I  choose 
the  similarity  of  users’  historical  behavior  to  measure  user-user  correlations.  A  user 
can  perform  a  variety  of  actions,  including  liking,  commenting,  and  sharing.  Hence, 
similarity  is  calculated  as  a  function  of  the  total  amount  of  actions  performed  by  two 
users  together: 

T(i,  j)  =  h(l(i,j),  c(i,j ),  s(i,j)),  (3.4) 

where  l(i,j ),  c(i,j)  and  s(i,j)  record  the  number  of  likes,  comments  and  shares, 
respectively,  performed  by  and  Uj  together.  /?,(•)  combines  these  users’  behaviors 
together,  which  is  defined  as  a  sign  function  in  this  chapter: 

1  if  l(i,j)  +  c(i,j)  +  s(i,j)  >  0, 

T(*,j)=<  (3.5) 

0  Otherwise. 

With  T(f,  j),  I  model  user-user  correlations  by  minimizing  the  following  term  as 

n  n 

min  ^(*>^)IIU(*>  0  — UC7,  ;)||1  (3.6) 

i=  1  j= 1 
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Users  close  to  each  other  in  the  low-rank  space  are  more  likely  to  be  similar  and 
their  distances  in  the  latent  space  are  controlled  by  their  correlation  coefficients.  For 
example,  T (i ,  j )  controls  the  latent  distance  between  rq  and  Uj.  A  larger  value  of 
indicates  that  ut  and  u3  are  more  likely  to  be  similar.  Thus,  I  force  their 
latent  representation  should  be  as  close  as  possible,  while  a  smaller  value  of 
tells  that  the  distance  of  their  latent  representation  should  be  larger. 

For  a  particular  user  rq ,  the  terms  in  Eq.  (3.6)  related  to  her  latent  representation 
Uj  are, 

n 

min  X>(i,j)||U(i,:)-U(j,:)llt  (3.7) 

3= 1 

I  can  see  that  the  latent  representation  of  u,  is  smoothed  with  other  users,  controlled 
by  T(i,  j),  hence  even  for  long  tail  users,  with  a  few  or  even  without  any  actions, 
I  can  still  get  an  approximate  estimate  of  their  latent  representation  via  user-user 
correlations,  addressing  the  sparsity  problem  in  Eq.  (3.3). 

After  some  derivations,  I  can  get  the  matrix  form  of  Eq.  (3.6)  as 

1  n  n 

)EE  T(*,j)||U(*,:)-U(j,:)||2 

i= 1  j= 1 

1  n  n  d 
i— 1  j= 1  k= 1 

n  n  d  n  n  d 

=  EEE  EEE 

i=  1  j= 1  fc=l  i= 1  j=  1  k=l 

d 

—  ^2  UT(:>  ^)(D“  —  S)U(:,  k) 

k= 1 

=  Tr(UT£uU),  (3.8) 

where  C"  =  D"  —  S  is  the  Laplacian  matrix  and  D“  is  a  diagonal  matrix  with  the 
7-t.h  diagonal  element  D u(i,i)  =  X(j=i  ^(i/O-  S  is  the  user-user  correlation  matrix 
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defined  as 


^(1,1)  *(1,2) 

*(2,1)  *(2,2) 

S  = 

*(n,  1)  *(n,  2) 

Modeling  Post-Post  Correlations 


*(1,  n) 
*(2,  n) 

*(rt,  n)  ^ 


Apart  from  likes,  posts  also  receive  other  actions  including  commenting  and  sharing 
from  users.  This  subsection  provides  a  way  to  include  these  activities  on  posts.  Con¬ 
sistency  [1]  is  one  of  the  important  social  theories  developed  to  explain  users’  actions, 
which  suggests  that  users’  actions  on  similar  posts  are  likely  to  remain  consistent. 
These  intuitions  motivate  us  to  obtain  low-rank  space  representation  of  posts  based 
on  historical  actions  received  by  them.  I  define  *(i,  j)  to  measure  the  post-post  cor¬ 
relation  between  Vi  and  Vj.  In  this  chapter,  I  choose  the  similarity  of  actions  received 
by  posts  to  measure  post-post  correlations.  A  post  can  receive  a  variety  of  actions,  in¬ 
cluding  liking,  commenting,  and  sharing.  Hence,  similarity  is  calculated  as  a  function 
of  the  total  amount  of  actions  received  by  two  posts  together: 


=  diKhj),  c(bj),  s(i,j)),  (3.9) 

where  l(i,  j),  c(i,  j)  and  s(i,  j)  record  the  number  of  users  who  perform  likes,  comments 
and  shares,  respectively,  on  pz  and  Pj  together,  g(-)  combines  these  users’  behaviors 
together,  which  is  defined  as  a  sign  function  in  this  chapter: 

1  if  l(i,j)  +  c(i,j)  +  s(i,j)  >  0, 

=  <  (3-10) 

0  Otherwise. 
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With  <f>(i,  j),  I  model  post-post  correlations  by  minimizing  the  following  term  as 


min  £5>(i,j)||v(v)-V(j,:)K  (3.11) 

i=  1  j= 1 

Posts  close  to  each  other  in  the  low-rank  space  are  more  likely  to  be  similar  and 
their  distances  in  the  latent  space  are  controlled  by  their  correlation  coefficients.  For 
example,  controls  the  latent  distance  between  Vi  and  v3.  A  larger  value  of 

indicates  that  vt  and  v3  are  more  likely  to  be  similar.  Thus,  1  force  their  latent 
representations  should  be  as  close  as  possible,  while  a  smaller  value  of  <F(i,j)  tells 
that  the  distance  of  their  latent  representation  should  be  larger. 

For  a  particular  post  u,,  the  terms  in  Eq.  (3.6)  related  to  its  latent  representation 
V,;  are, 

m 

min  ^$(i,i)||V(i,:)-V(j,:)||2  (3-12) 

3= 1 

1  can  see  that  the  latent  representation  of  v%  is  smoothed  with  other  posts,  controlled 
by  <F(i,  j)>  hence  even  for  long  tail  posts,  with  a  few  or  even  without  any  actions, 
1  can  still  get  an  approximate  estimate  of  their  latent  representation  via  post-post 
correlations,  addressing  the  sparsity  problem  in  Eq.  (3.3). 

By  following  the  derivations  in  Eq.  (3.8),  1  can  get  the  matrix  form  of  Eq.  (3.11) 
as 

..mm 

EE  Hhj) ||V(i, :)  -  V (j,  :)||1  =  Tr(VT/7V),  (3.13) 

i= 1  |=1 

where  C°  =  Dl  —  P  is  the  Laplacian  matrix  and  Dl  is  a  diagonal  matrix  with  the 
i-th  diagonal  element  Du(*,i)  =  *)■  P  is  post-post  correlation  matrix 

defined  as 
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$(1,1)  $(1,2)  •••  $(l,n)^ 
$(2,1)  $(2,2)  •••  $(2,  n) 


y$(n,  1)  $(n,  2)  •  •  •  $(n,  n)  J 

3.2.3  The  Proposed  Algorithm  to  Learn  Compact  Representation 


With  the  components  of  modeling  user-user  and  post-post  correlations,  the  pro¬ 
posed  algorithm  is  to  solve  the  following  optimization  problem  first. 


min  ||W  0  (R  -  UHVT)|||  +  A(||U||2F  +  ||V||2F  +  ||H||J.) 

U  .n,  V 

+aTr(UT£uU)  +  /?7V(VT£,V),  (3.14) 

where  the  first  term  is  used  to  exploit  the  available  users’  like  actions  on  posts,  second 
term  captures  user-user  correlations,  and  post-post  correlations  are  captured  by  third 
term.  The  parameter  a  and  /3  is  introduced  to  control  the  contribution  from  user-user 
and  post-post  correlations,  respectively. 

The  optimization  problem  in  Eq.  (3.14)  is  a  multi-objective  with  respect  to  the 
three  variables  U,  H,  and  V  together.  A  local  minimum  of  the  objective  function  J 
in  Eq.  (3.14)  can  be  obtained  through  an  alternative  scheme. 

Computation  of  H.  Optimizing  the  objective  function  in  Eq.  (3.14)  with  respect 
to  H  is  equivalent  to  solving 

min  JH  =  ||W©(R-UHVT)|||  +  A||H|||  (3.15) 

H 

The  derivative  of  Jh  with  respect  to  H  is 

^  =  — 2Ut(W  ©  W  0  R)V  +  2Ut(W  ©  W  0  UHVT)V  +  2 AH  (3.16) 
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Hence,  the  minimum  can  be  achieved  by  substituting 
[Inf]  (hi)  =  0-  Thus,  I  obtain 


-UT(W  0  W  0  R)V  +  UT(W  0  W  0  UHVt)V  +  AH](j,  j)  =  0  (3.17) 


Similar  to  [18],  it  leads  to  the  updating  rule  of  H, 


H«  i)J  [UT(ff  0WQR)V]  (i,j) 

l  ,3)  l  ,3) y  [uT(W0W©UHVT)V  +  AH](i,i) 


(3.18) 


Computation  of  U.  Similar  to  the  computation  of  H,  optimizing  the  objective 
function  in  Eq.  (3.14)  with  respect  to  U  leads  to  the  updating  rule  of  U, 


[(W  ©  W  0  R)VHT  +  oSU]  (i,j) 

[(W  0  W  ©  UHVT)VHT  +  aD(1U  +  AU]  Jjj) 


(3.19) 


Computation  of  V.  Similarly,  optimizing  the  objective  function  in  Eq.  (3.14)  with 
respect  to  V  leads  to  the  updating  rule  of  V, 


[(W  0  W  0  R)TUH  +  ^PV]  (i,  j ) 

[(W  0  W  ©  UHVT)T UH  +  /3Dt,V  +  AV]  (i,  j) 


(3.20) 


It  can  be  proven  that  updating  rules  in  Eq.  (3.18),  Eq.  (3.19)  and  Eq.  (3.20)  are 
guaranteed  to  converge.  Since  the  proof  process  is  similar  to  that  in  [66,  18],  to  save 
space,  I  omit  the  detailed  proof  of  the  convergence  of  the  updating  rules  in  Eq.  (3.18), 
Eq.  (3.19)  and  Eq.  (3.20). 

In  summary,  I  present  the  computational  algorithm  for  optimizing  Eq.  (3.14)  in 
Algorithm  1.  In  the  algorithm,  I  conduct  initialization  of  two  Laplacian  matrices, 
and  random  initialization  of  three  matrices  to  be  inferred  from  lines  1  and  2.  T 
is  the  number  of  maximum  iterations.  The  three  matrices  are  updated  with  the 
updating  rules  until  convergence  or  reaching  the  number  of  maximum  iterations. 
The  correctness  and  convergence  of  the  updating  rule  can  be  proved  with  standard 
auxiliary  function  approach  [66,  18]. 
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Algorithm  1  The  Proposed  Algorithm  to  Learn  Compact  Representation 

Input:  R,  S.  P,  d,  A,  a,  T 

Output:  U,  V 

1:  Construct  matrices  Lu  and  L,.  in  Eq.  3.8  and  3.13 
2:  Initialize  matrices  U,  V  and  H  randomly 
3:  while  Not  convergent  and  t  <  T  do 

4 

5 

6 

7:  t  =  t  +  1 

8:  end  while 

Algorithm  2  The  Proposed  Framework  SCOUT. 

Input:  R,  S.  P  A,  a,  f3,  T ,  and  Y L 

Output:  A  SVM  Classifier  / 

1:  Learn  the  compact  representation  U  for  u  by  Algorithm  1; 

2:  Obtain  the  compact  representation  U l  for  the  labeled  users  u^; 

3:  Train  a  SVM  classifier  /  with  U l  and  Y l] 

3.2.4  Building  A  Classifier  for  Attribute  Prediction 

After  obtaining  the  low-rank  representation  of  U  by  Algorithm  1,  I  choose  the 
well-known  linear  SVM  as  the  basic  classifier  for  the  attribute  prediction  task.  The 
detailed  framework  SCOLIT  is  presented  in 

Next  I  briefly  review  the  above  algorithm.  In  line  1,  I  learn  the  compact  rep¬ 
resentation  by  Algorithm  1,  and  in  line  3,  I  training  a  SVM  classifier  based  on  the 


Update  H(i,  j)  •(—  H (i,j) 


Update  U (i,j)  t—  U (i,j) 


Update  V(i,j)  <-  V(i,j ) 


UT  (W0W0R)  V]  (i,j) 

UT  (W0W0UHV1  )V+AH]  (ij) 

'(W©W©R)VHt +uSU]  (i,j) 
(W0W0UHV1  )VHt  +aDuU+AU]  (ij) 

'(W©W©R)t  UH+/3PV]  (i,j) 
(W0W0UHV1  )T  UH+/3DwV+AVl  (i,j) 
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representation  of  the  labeled  users  \J  l  and  their  labels  Y L.  Note  that  the  proposed 
framework  uses  SVM  as  the  basic  classifier  which  only  takes  discrete  values  as  labels 
such  as  gender,  sexual  orientation,  relationship  status,  and  religious  affiliations  etc. 
For  a  continuous  valued  attribute  such  as  age,  I  simply  perform  discretization  for 
SVM.  Actually  I  can  choose  regression  models  as  basic  models  to  deal  with  continued 
value  attributes  and  1  would  like  to  leave  it  as  the  future  work. 

3.3  Experiments 

In  this  section,  I  conduct  experiments  to  answer  the  following  two  questions  -  (1) 
can  the  proposed  framework  predict  users’  attributes  from  public  interaction  data? 
and  (2)  is  it  necessary  to  learn  a  compact  representation?  After  the  introduction  of 
experimental  settings,  1  design  experiments  to  answer  above  two  questions  and  finally 
perform  analysis  for  the  important  parameters  of  the  proposed  framework. 

3.3.1  Facebook  Datasets 

For  experiments,  I  collect  a  Facebook  dataset  consisting  of  users’  interactions 
on  the  “Basher  Kclla”  Page1  during  the  recent  events  of  the  Bangladesh  protests2. 
The  “Basher  Kclla”  Facebook  page  represents  the  influential  political  organization  in 
Bangladesh  which  also  has  the  records  of  supporting  violence3.  This  Facebook  page4 
was  founded  on  March  7,  2013.  From  March  7,  2013  till  April  21,  2013,  I  collect 
Facebook  users’  actions  on  all  the  posts  published  on  this  page.  The  majority  of 

1 https : / / www .facebook . com/ newbasherkella 

2http ://en. Wikipedia . org/wiki/2013_Shahbag_protests 

3http : //www. thedaily star .net /bet a2/news/net- instigation- in- full -force/ 

4  Old  version  of  the  Basher  Kella  Facebook  page  was  banned  during  the  recent  events  of  the 

Bangladesh  protests  due  to  its  violent  content.  On  March  7,  a  new  page  was  created  which  can  be 
accessed  using  https://www.facebook.com/newbasherkella 
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these  posts,  comments  and  replies  are  written  in  Bengali  language. 

Table  3.1:  Statistics  of  the  Faeebook  Dataset 


#  of  days  crawl 

47 

#  of  users 

498,674 

#  of  public  posts 

9,907 

Avg  #  of  Likes  per  user 

15.87 

Avg  #  of  Likes  per  post 

580.10 

Avg  #  of  Comments  per  user 

1.20 

Avg  #  of  Comments  per  post 

44.11 

Avg  #  of  Shares  per  user 

2.79 

Avg  #  of  Shares  per  post 

139.89 

Originally  I  collect  42,599  public  posts  and  the  administrators  of  this  page  con¬ 
tribute  23.25%  (9907)  of  the  total  number  of  posts  (admin-posts).  As  1  expected, 
users’  actions  on  the  admin-posts  were  significantly  higher  than  the  posts  from  other 
Faeebook  users.  Majority  of  the  posts  from  other  users  contains  no  actions.  Therefore 
1  only  use  public  posts  from  the  administrators  of  the  page  in  this  study.  For  each 
post,  I  collect  all  the  users  who  likes,  comments  and  shares  it.  For  each  comment  on 
a  post,  1  collect  all  the  users  who  like,  and  reply  it.  For  each  reply,  1  collect  all  the 
users  who  likes  it.  For  each  share,  1  collect  all  the  users  who  like  and  comment  on 
it.  Finally,  for  each  comment  on  a  share,  1  also  collect  users  who  like,  and  reply  it. 
Also,  for  each  reply  on  a  share  comment,  I  collect  all  the  users  who  likes  it.  Table  3.1 
shows  the  overall  statistics  of  the  dataset. 
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In  this  work,  I  choose  three  attributes,  i.e.,  religious  affiliation,  relationship  status 
and  sexual  orientation.  For  the  religious  affiliation  attribute,  to  establish  the  ground 
truth  for  evaluation,  I  first  use  the  Facebook  graph  search  results  to  examine  the  set 
of  users  who  set  the  attribute  available  to  the  public  and  then  collect  the  attributes  of 
these  users  with  their  public  interaction  data  to  establish  a  dataset,  Facebook- religion, 
to  assess  the  performance  of  the  proposed  framework.  The  statistics  of  Facebook- 
religion  is  shown  in  Table  3.1a.  I  only  find  a  small  portion  of  users  (2853  out  of 
42,599)  who  set  their  religious  affiliation  publicly  available,  which  is  consistent  with 
the  observation  in  [32].  I  obtain  5  values  for  religious  affiliation,  i.e.,  Muslim,  Atheist, 
Buddhist,  Hindu,  or  Christian  and  these  five  religions  are  indeed  the  top  five  religions 
in  Bangladesh  based  on  their  populations5.  To  assess  the  performance  of  the  proposed 
framework  with  other  attributes,  I  also  choose  another  two  attributes,  relationship 
status  and  sexual  orientation.  For  each  of  these  two  attributes,  following  a  similar 
process,  I  build  a  dataset  based  on  2853  users.  The  statistics  of  Facebook-rclation 
and  Facebook-sexual  are  shown  in  3.1b  and  3.1c,  respectively.  For  relationship  status, 
I  consider  two  values,  smgle  and  not-single.  All  the  Facebook  users  with  relation 
status  values  as  “married”,  “engaged”  and  “in  a  relationship”  are  considered  not- 
single,  whereas  “single”,  “widowed”,  and  “divorced”  are  considered  single.  Sexual 
orientation  attributes  are  interpreted  using  “Interested  In”  values  from  Facebook 
users’  profiles.  I  consider  three  values  for  sexual  orientation  such  as  users  who  like 
men ,  users  who  like  women ,  and  users  who  like  both  men  and  women. 

For  each  dataset,  I  choose  x%  of  the  dataset  for  training  and  the  remaining  (1— x)% 
as  testing.  In  this  work,  I  vary  x  as  {50,60,70,80,90}.  For  each  x ,  I  repeat  the 
experiments  5  times  and  report  the  average  performance. 

From  the  evaluation  perspective,  precision  and  recall  are  equally  important  for 

°http ://en. Wikipedia . org/ wiki/Bangladesh 
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Religion 

#  of  Users 

Percentages(%) 

Muslim 

1866 

65.40 

Atheist 

216 

7.57 

Buddhist 

113 

3.96 

Hindu 

463 

16.23 

Christian 

195 

6.84 

Total 

2853 

100 

(a)  Statistics  of  Facbook-Religion  Dataset 


Values 

#  of  Users 

Percentages  (%) 

Single 

760 

65.01 

Not-single 

409 

34.99 

Total 

1169 

100 

(b)  Statistics  for  the  Facebook-Relation  Dataset 


Values 

#  of  Users 

Percentages  (%) 

Men 

196 

9.65 

Women 

507 

24.96 

Men  &  Women 

1328 

65.39 

Total 

2031 

100 

(c)  The  Statistics  for  Facebook-Interested-In  Dataset 
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the  prediction  task.  For  example,  in  an  Islamic  country  like  Bangladesh,  the  cost  of 
incorrectly  predicting  someone  as  atheist  could  be  disastrous,  as  it  carries  connota¬ 
tions  of  blasphemy6.  However,  precision,  recall,  and  Fl-score  are  biased  towards  one 
of  the  labels.  Hence,  it  is  unsuitable  for  the  unbalanced  evaluation  dataset.  For  this 
purpose,  I  use  commonly  adopted  macro-average  FI  score  to  assess  the  prediction 
performance,  as  it  gives  equal  weight  to  all  the  labels.  The  macro- average  FI  score 
is  defined  as, 

\  '  ^  I ‘  2vr- 

macro  —  FI  =  — — — -,  where  F%  =  1  1  ,  (3.21) 

A  pi  +  r\ 

where  pt  and  rt  refer  to  the  precision  and  recall  values  associated  with  the  i-th  label, 
respectively.  Note  that  A-score  can  not  be  computed  for  a  baseline  which  always 
picks  the  majority  label  for  the  prediction. 

3.3.2  Performance  Evaluation 

In  this  subsection,  I  conduct  experiments  to  answer  the  first  question  -  can  the 
proposed  framework  predict  users’  personal  attributes  from  users’  public  interaction 
data?  To  answer  this  question,  I  investigate  the  performance  of  the  proposed  frame¬ 
work  by  comparing  it  with  the  random  performance.  For  SCOUT,  I  choose  the  cross- 
validation  to  determine  the  parameter  values  and  more  details  about  the  parameter 
analysis  will  be  discussed  in  the  following  subsection.  I  empirically  set  the  number  of 
latent  dimensions  d  to  50.  The  performance  results  for  Facebook- religion,  Facebook- 
relation  and  Facebook-sexual  are  demonstrated  in  Figures  3.1a,  3.1b  and  3.1c,  respec¬ 
tively. 

I  have  the  following  observations: 

•  For  all  the  Figures  3.1a,  3.1b  and  3.1c,  the  performance  of  SCOUT  increases 

6http ://en. Wikipedia . org/ wiki/ Atheism_and_religion#Atheism_in_Islam 
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with  the  increase  of  x.  This  is  due  to  the  fact  that  more  training  data  helps  to 
build  a  better  SVM  classifier. 

•  The  proposed  framework  consistently  outperforms  the  random  method.  The 
proposed  algorithm  gains  up  to  70.49%  and  49.83%  relative  improvement  in 
Facebook-religion  and  Facebook-sexual,  respectively.  I  conduct  a  t-test  on  these 
results  and  the  evidence  from  t-test  suggests  that  the  improvement  is  signifi¬ 
cant.  These  results  support  that  users’  personal  attributes  are  predictable  from 
public  interaction  data.  In  the  following  subsections,  I  will  investigate  the  con¬ 
tributions  from  different  components  to  this  improvement. 

In  conclusion,  above  results  suggest  a  positive  answer  to  the  first  questions  - 
the  proposed  framework  can  predict  various  users’  personal  attributes  from  public 
interaction  data. 

3.3.3  Impact  of  the  Learnt  Compact  Representation 

As  mentioned  above,  the  public  interaction  data  representation  matrix  R  is  very 
sparse  and  I  proposed  an  algorithm  to  learn  a  compact  representation  U  with  the  help 
of  social  theories.  In  this  subsection,  I  study  the  impact  of  the  compact  representation 
on  the  performance  of  the  proposed  framework  to  answer  the  second  question.  In 
detail,  I  define  the  following  variants: 

•  SCOUT-Corr:  Eliminate  the  impact  of  the  user-user  and  post-post  correlations 
by  setting  a  —  j3  —  0  in  Eq.  (3.14); 

•  SCOUT-Corr-W:  Eliminate  the  effects  from  both  the  correlation  and  the  weight 
matrix  W  by  setting  a  =  /3  =  0  and  W  to  be  a  matrix  with  all  entities  equal 
to  1  in  Eq.  (3.14);  and 
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Figure  3.1:  Prediction  Performance  of  the  Proposed  Framework 
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•  SCOUT-R:  Eliminate  the  impact  of  the  compact  representation  by  learning  the 
SVM  classifier  with  the  original  matrix  R. 

I  only  show  results  on  Faeebook-relation  in  Figure  3.2  since  I  have  similar  ob¬ 
servations  with  other  settings.  Note  that  sometimes  the  classifier  gives  the  majority 
prediction  and  I  can  not  compute  macro-FI  in  this  situation;  hence  I  use  “N.A.”  to  de¬ 
note  the  performance  in  the  table.  When  eliminating  the  impact  of  the  user-user  and 
post-post  correlations,  the  performance  of  SCOUT-Corr  reduces,  which  indicates  the 
importance  of  incorporating  these  correlations  based  on  social  theories.  When  elimi¬ 
nating  these  correlations  and  the  weight  matrix  W,  a  traditional  matrix  factorization 
algorithm  learns  the  compact  representation  and  the  performance  of  SCOUT-Corr- W 
reduces  dramatically.  As  mentioned  before,  the  extreme  sparsity  of  R  will  lead  to  the 
learned  compact  representation  close  to  zero.  When  building  the  classifier  based  on 
the  sparse  matrix  R,  the  performance  of  SCOUT-R  also  reduces  a  lot.  I  can  not  learn 
a  good  classifier  based  on  the  sparse  and  high- dimensional  matrix  R,  which  directly 
supports  the  importance  of  learning  a  compact  representation  for  uesrs. 

In  conclusion,  the  learned  compact  representation  can  mitigate  the  sparse  problem 
of  public  interaction  data  and  plays  an  important  role  in  the  performance  improve¬ 
ment  of  the  proposed  framework,  which  correspondingly  answers  the  section  question. 

3.3.4  Impact  of  User-User  and  Post-Post,  Correlations 

The  parameters  a  and  f3  are  introduced  to  control  the  contributions  from  user- 
user  and  post-post  correlations  for  the  proposed  framework  SCOUT.  Therefore,  I 
investigate  the  impact  of  user-user  and  post-post  correlations  via  analyzing  how  the 
changes  in  a  and  /3  affect  the  performance  of  SCOUT  in  terms  of  the  attribute 
prediction.  I  vary  the  value  of  a  and  f3  as  {0,  0.01, 0.1, 1, 10, 100, 1000}.  The  results 
are  shown  in  Figure  3.3  for  Faeebook-religion.  I  ignore  the  results  with  other  settings 
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Figure  3.2:  Impact  of  the  Learnt  Compact  Representation  on  the  Proposed  Frame¬ 
work.  Note  that  Sometimes  the  Classifier  Gives  the  Majority  Prediction  and  I  can  Not 
Compute  Macro-FI  in  This  Situation;  Hence  I  Use  “N.A.”  to  Denote  the  Performance 
in  the  Table. 


since  I  have  similar  observations. 

In  general,  with  the  increase  of  a  and  /3,  I  observe  similar  patterns:  first  increasing, 
reaching  its  peak  value  and  then  degrading  rapidly.  These  patterns  can  be  used  to 
determine  the  optimal  value  of  a  and  (3  for  SCOUT  in  practice.  In  particular  it  can 
be  observed, 

•  When  a  is  increased  from  zero,  eliminating  the  impact  of  user-user  correlations 
to  SCOUT,  to  0.1,  the  performance  improves,  suggesting  the  importance  of 
user-user  correlations  in  the  proposed  framework  to  mitigate  the  data  sparsity 
problem. 

•  When  (3  is  increased  from  zero,  eliminating  the  impact  of  post-post  correlations 
to  SCOUT,  to  0.01,  the  performance  improves,  suggesting  the  importance  of 
post-post  correlations  in  the  proposed  framework. 

•  SCOUT  achieves  its  best  performance  when  a  =  0.1  and  (3  =  0.01,  fur¬ 
ther  demonstrating  the  importance  of  user-user  and  post-post  correlations  in 
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Figure  3.3:  Impact  of  User-User  and  Post-Post  Correlations  on  Predicting  Religious 
Affiliations,  Respectively. 

SCOUT. 

•  From  cc  =  0.1  to  a  =  1000,  the  performance  decreases  rapidly.  In  comparison, 
performance  decrease  is  not  rapid  from  /3  =  0.01  to  (3  =  1000.  When  a  and  /3  is 
very  large,  user-user  and  post-post  correlations  dominate  the  learning  process 
and  the  learned  representation  is  inaccurate.  For  example,  when  a  — )■  +oo 
(3  — >  +oo,  I  will  obtain  a  trivial  solution,  all  Uj  for  (1  <  i  <  n )  are  exactly  the 
same. 

In  summary  appropriate  incorporation  of  user-user  and  post-post  correlations  into 
the  dimension  reduction  algorithm  can  greatly  improve  the  prediction  performance 
of  personal  attributes. 
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3.4  Summary 


In  order  to  secure  users’  privacy,  researchers  have  been  exploring  different  cues  to 
show  users  vulnerabilities  against  privacy  attacks.  Unlike  these  methods,  my  focus 
is  on  predicting  personal  attributes  from  publicly  available  interactions  alone  so  that 
users  can  secure  privacy.  To  the  best  of  my  knowledge,  I  am  the  first  to  address 
this  problem.  I  provide  a  way  to  mitigate  the  sparsity  problem  of  public  interaction 
data  with  the  help  of  social  theories  and  propose  a  novel  framework,  SCOUT,  to 
predict  users’  personal  attributes  from  public  interaction  data.  The  evaluation  of  the 
proposed  framework  with  the  real-word  data  from  Faeebook  page  shows  that  users’ 
attributes  are  predictable. 
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Chapter  4 


LITERATURE  REVIEW 


User  privacy  on  a  social  networking  site  has  received  considerable  attention  re¬ 
cently.  Gross  and  Acqnisti  [31]  evaluate  the  amount  of  information  disclosed  through 
a  social  networking  site  and  study  usage  of  privacy  settings.  This  work  revealed  that 
only  a  few  users  change  the  default  privacy  preferences  on  Faeebook.  Narayanan 
and  Shmatikov  [55,  56]  show  that  users  are  not  well  protected  on  a  social  networking 
site  by  successfully  deannonymizing  network  data  solely  based  on  network  topol¬ 
ogy.  They  also  highlight  the  fact  that  privacy  laws  are  inadequate,  confusing,  and 
inconsistent  amongst  nations  making  social  networking  sites  more  vulnerable.  Won- 
dracek  et  al.  [79]  propose  a  simple  deanonymization  scheme  which  exploits  the  group 
membership  information  to  breach  users  privacy.  They  also  pointed  out  that,  so¬ 
cial  network  data  aggregation  projects  such  as  OpenID1,  DataPortability2,  the  social 
graph  project3,  and  various  microformats4  potentially  represent  a  greater  threat  to 
an  individual  privacy. 

Liu  and  Maes  [46]  point  towards  lack  of  privacy  awareness  and  find  large  number 

of  social  network  profiles  in  which  people  describe  themselves  using  a  rich  vocabulary 

of  their  passions  and  interests.  This  fact  strengthen  the  need  for  vulnerability  research 

on  a  social  networking  site  to  make  users  aware  of  privacy  risks.  Krishnamurthy  and 

Wills  [42]  discuss  the  problem  of  leakage  of  personally  identifiable  information  and 

how  it  can  be  misused  by  third  parties  [56].  Ho  et  al.  [36]  discover  three  privacy 

1http://openid.net/ 

2http://www. dataportability.org/ 

3http:  / /bradfitz.  com/social-graph-problem/ 

4http:/ /microformats,  org/ 
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problems  on  most  social  networking  sites.  First,  users  are  not  notified  by  social 
networking  sites  when  their  personal  information  is  at  risk.  Second,  existing  privacy 
protection  tools  are  not  flexible  enough.  Finally,  users  cannot  prevent  information 
that  may  reveal  private  information  about  themselves  from  being  uploaded  by  any 
other  user.  These  observations  validate  the  index  estimation  method  proposed  in  the 
Chapter  2. 

There  has  been  some  research  which  suggests  the  fundamental  changes  to  social 
networking  sites  to  achieve  user  privacy.  Squicciarini  et  al.  [69]  introduce  a  novel  col¬ 
lective  privacy  mechanism  for  better  managing  the  shared  content  between  the  users. 
Fang  and  LeFevre  [21]  focus  on  helping  users  to  express  simple  privacy  settings  but 
they  have  not  considered  additional  problems  such  as  attribute  inference  [82],  or 
shared  data  ownership  [69] .  Zheleva  and  Getoor  [82]  show  how  an  adversary  can  ex¬ 
ploit  an  online  social  network  with  a  mixture  of  public  and  private  user  profiles  to  pre¬ 
dict  the  private  attributes  of  users.  Baden  et  al.  [6]  present  a  framework  where  users 
dictate  who  may  access  their  information  and  based  on  public-private  encryption- 
decryption  algorithms.  Although  the  proposed  framework  address  privacy  concerns, 
it  comes  at  the  cost  of  increased  response  time  from  a  social  networking  site.  Pro¬ 
posed  work  does  not  suggest  any  fundamental  changes  to  social  networking  sites.  I 
find  users  can  secure  user  privacy  by  unfriending  the  vulnerable  friends.  Unfriending5 
has  been  studied  recently  but  I  am  the  first  one  to  propose  unfriending  to  reduce  the 
vulnerability  of  a  user. 

Psychologists  have  long  been  predicting  user  traits  and  attributes  based  on  various 
types  of  information  such  as  samples  of  written  text  [22],  answers  to  psychometric 
tests  [14],  or  the  appearances  of  places  people  inhibit  [30].  Most  of  these  researches 
are  based  on  an  assumption  that  users  have  tendencies  to  inadvertently  leave  behind 
5http:/ /www.nytimes.com/2010/10/24/fashion/24Studied.html 
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cues  which  correlate  with  their  personal  attributes.  Recently  computer  scientists  are 
also  exploring  users’  personal  attributes  based  on  cues  from  the  web,  such  as  user’s 
web  site  browsing  logs  [54,  37,  17,  26],  contents  of  personal  web  sites  [51],  and  music 
collections  [64], 

Social  media  popularity  has  created  several  opportunities  for  users  to  create  data. 
Massive  amount  of  social  media  data  has  attracted  attention  of  privacy  researchers  to 
identify,  measure  and  mitigate  the  risks  of  predicting  personal  attributes  [39,  53,  62, 
13,  60,  28,  44,  11],  Jernigan  and  Mistree  [39]  shows  that  location  within  a  friendship 
network  at  Facebook  is  predictive  of  sexual  orientation.  Mislove  et.al.  proposed  a 
method  of  inferring  user  attributes  that  is  inspired  by  examining  the  normalized  con¬ 
ductance  of  the  existing  friend  lists,  whose  value  ranges  from  -1  to  1,  with  strongly 
positive  values  indicating  significant  community  structure.  Prior  studies  of  social 
network  graphs  have  found  that  normalized  conductance  values  greater  than  0.2  cor¬ 
respond  to  strong  communities,  that  could  be  detected  fairly  accurately  by  community 
detection  algorithms  [53] .  Rao  et  al.  [63]  predicts  gender  from  tweet  texts  alone  using 
an  N-gram  only  model  and  hand-crafted  sociolinguistic-based  features.  Conover  et 
al.  [12]  proposed  several  methods  for  predicting  the  political  alignment  of  Twitter 
users  based  on  the  content  and  structure  of  their  political  communication  in  the  run¬ 
up  to  the  2010  U.S.  midterm  election.  Quercia  et  al.  [60]  found  a  positive  correlation 
between  number  of  followers/following  and  age.  Golbeck  et  al.  [29]  used  LIWC  fea¬ 
tures  over  a  sample  of  167  Facebook  volunteers  as  well  as  profile  information  and 
found  limited  success  of  a  regression  model  in  predicting  personality  of  a  user.  Li  et 
al.  [44]  profile  users’  location  by  integrating  both  friendship  and  content  information 
in  a  probabilistic  model. 

An  inspiring  work  from  Kosinski  et.al.  [41]  shows  that  wide  variety  of  people’s 
highly  sensitive  personal  attributes  can  be  automatically  and  accurately  inferred  us- 


ing  the  variety  of  Facebook  likes.  Personal  attributes  include  sexual  orientation, 
ethnicity,  religious  and  political  views,  personality  traits,  intelligence,  happiness,  use 
of  addictive  substances,  parental  separation,  age  and  gender.  Conclusions  in  this 
work  are  based  on  the  Facebook  users  personal  data  which  includes  their  Facebook 
likes,  detailed  demographic  profiles,  and  results  of  several  psychometric  tests.  Chap¬ 
ter  3  assume  that  users’  personal  data  is  not  available,  as  the  data  visibility  can  be 
controlled  using  users’  profile  settings.  The  focus  of  this  Chapter  is  on  a  novel  type  of 
data  which  is  not  only  publicly  available  but  also  beyond  the  control  of  users’  profile 
settings.  The  proposed  framework  SCOUT  aims  to  explore  personal  attributes  from 
such  data  so  that  users  can  secure  themselves  against  privacy  attacks.  Table  4  sum¬ 
marizes  literature  work  on  prediction  of  private  attributes.  It  also  shows  the  required 
data  for  successfully  identification  of  those  attributes.  Social  media  related  work  is 
highlighted  in  bold. 
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Table  4.1:  Summarization  of  Literature  Work  on  Prediction  of  Private  Attributes.  Social  Media  Related  Work  is  High¬ 
lighted  in  Bold 


Private  Attributes/Traits 

Required  Data  for  Prediction 

Gender 

Chat  messages  [43],  Web-pages  and  their  browsing  history  [37],  Web-search  query  logs  [40],  Lin¬ 
guistic  features  from  tweets  [63]  and  Facebook  posts  [61],  Web-pages’  browsing  his¬ 
tory  [17,  26],  Statistical  features  alone  from  Twitter  [60],  Full  names,  profiles  and  con¬ 
tents  from  Twitter  [9],  Statistical  and  content  features  from  Twitter  [3],  Facebook 

likes  [41],  First-names  in  Twitter  [47] 

Age 

Web-pages  and  their  browsing  history  [37] ,  Web-search  query  logs  [40] ,  Linguistic  features  from 

tweets  [63],  Web-pages’  browsing  history  [17,  26],  Statistical  features  alone  from  Twit¬ 
ter  [60],  Statistical  and  content  features  from  Twitter  [3],  Facebook  likes  [41] 

Location 

Web-search  query  logs  [40],  Visual,  textual  and  temporal  features  from  Flickr  photos  [16],  GPS  his¬ 
tory  data  [84,  83],  Linguistic  features  from  tweets  [63],  Facebook  network  and  address  [5], 

Tweet  content  [35],  Check-in  data  from  Gowalla  and  Brightkite  [10],  Cell-phone  location 

trace  data  [68,  10] 

Relationship  Status 

Facebook  likes  [41] 

Social  Security  Numbers 

location  and  date  of  birth  [2] 

Continued  on  next  page 


Table  4.1  -  Continued  from  previous  page 


Private  Attributes/Traits 

Required  Data  for  Prediction 

Religious  views 

Facebook  likes  [41] 

Political  views 

Music  Preferences  [64],  Linguistic  features  from  tweets  [63],  Profile,  network,  statistical 

and  content  features  from  Twitter  [59],  Twitter  follower  network  [27],  Tweet  con¬ 
tent  and  communication  network  [13,  12],  Statistical  and  content  features  from  Twit¬ 
ter  [3], Facebook  likes  [41],  Twitter  retweets  [80] 

Sexual  orientation 

Facebook  network  [39],  Facebook  likes  [41] 

Occupation 

Web-pages’  browsing  history  [17] 

Education  level 

Web-pages’  browsing  history  [26],  Web- pages’  browsing  history  [17] 

Household  Income 

Web-pages’  browsing  history  [26] 

Ethnicity 

Linguistic  features  from  Facebook  posts  [61],  Profile,  network,  statistical  and  content 

features  from  Twitter  [59],  Web-pages’  browsing  history  [26],  Facebook  likes  [41] 

Intelligence 

Facebook  likes  [41] 

Happiness 

Cell-phone  call  logs  [19],  Facebook  likes  [41] 

Use  of  addictive  substance 

Facebook  likes  [41] 

Continued  on  next  page 


Table  4.1  -  Continued  from  previous  page 


Private  Attributes/Traits 

Required  Data  for  Prediction 

Parental  Separation 

Facebook  likes  [41] 

Social  ties 

Cell-phone  call  logs  [19],  Photo  and  music  related  tagging  data  from  Flickr  and 

Last.fm  [65],  Spatio-temporal  features  from  Flickr  photos  [15] 

Size  and  density  of  the  friendship 

network 

Facebook  likes  [41] 

Big  Five  Personality  Traits 

(Openness,  Conscientiousness, 

Extraversion,  Agreeableness, 

Neuroticism) 

Music  Preferences  [64],  Personal  web-pages  and  visitors’  ratings  [51],  Statistical  and  content 

features  from  Twitter  [28]  and  Facebook  profiles  [29],  Statistical  features  alone  from 

Twitter  [60],  Facebook  likes  [41] 

Chapter  5 


CONCLUSION  AND  FUTURE  WORK 


In  this  dissertation,  I  systematically  studied  how  one  can  manage  their  vulnera¬ 
bilities  on  a  social  networking  sites.  Based  on  profile  settings  and  amount  of  available 
information,  users  are  categorized  into  four  types:  (1)  active  users  with  public  set¬ 
tings  (Q1  users),  (2)  active  users  with  private  settings  (Q2  users),  (3)  inactive  users 
with  public  settings  (Q3  users),  and  (4)  inactive  users  with  private  settings  (Q4  usres) 
(Please  refer  Figure  1.1  for  details).  For  each  of  these  types  of  users,  vulnerability 
can  be  managed  in  three  steps:  (1)  identifying,  (2)  measuring  and  (3)  reducing  a  user 
vulnerability.  The  main  focus  of  this  dissertation  is  on  active  users  only  i.e.  Q1  and 
Q2  users. 

Chapter  2  entirely  focuses  on  managing  vulnerability  of  Q1  users.  Since  rich  lit¬ 
erature  is  available  on  identifying  Q1  users  vulnerabilities,  this  chapter  provided  a 
novel  way  to  measure  a  Q1  user’s  vulnerability  on  a  social  networking  site,  and  then 
proposed  a  way  to  mitigate  its  vulnerability  while  retaining  social  utility  value.  I 
proposed  a  feasible  approach  to  a  novel  problem  of  identifying  a  user’s  vulnerable 
friends  on  a  social  networking  site.  This  work  differs  from  existing  work  addressing 
social  networking  privacy  by  introducing  a  vulnerability-centered  approach  to  a  user 
security  on  a  social  networking  site.  On  most  social  networking  sites,  privacy  related 
efforts  have  been  concentrated  on  protecting  individual  attributes  only.  However, 
users  are  often  vulnerable  through  community  attributes.  Unfriending  vulnerable 
friends  can  help  protect  users  against  the  security  risks.  Based  on  this  study  of  over 
2  million  users,  I  find  that  users  are  either  not  careful  or  not  aware  of  security  and 
privacy  concerns  of  their  friends.  The  proposed  model  clearly  highlights  the  impact 
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of  each  new  friend  on  a  user’s  privacy.  The  proposed  unfriending-based  mechanism 
does  not  require  the  structural  change  of  a  social  networking  site  and  aims  to  maxi¬ 
mally  reduce  a  user’s  vulnerability  while  minimizing  his  social  utility  loss.  The  work 
formulated  a  novel  problem  of  constrained  vulnerability  reduction  suggests  a  feasible 
approach,  and  demonstrated  that  the  problem  of  constrained  vulnerability  reduction 
is  solvable. 

Chapter  3  focused  on  understanding  vulnerabilities  of  Q2  users.  To  the  best  of  my 
knowledge,  little  is  known  about  the  vulnerabilities  of  Q2  users  before  this  dissertation 
work.  Similar  to  Q1  users,  the  aim  of  this  chapter  is  to  identify  the  vulnerabilities  of 
Q2  users  which  is  achieved  by  predicting  personal  attributes  from  publicly  available 
unprotected  interactions  alone.  I  provided  a  way  to  mitigate  the  sparsity  problem  of 
public  interaction  data  with  the  help  of  social  theories  and  proposed  a  novel  frame¬ 
work,  SCOUT,  to  predict  users’  personal  attributes  from  public  interaction  data.  The 
evaluation  of  the  proposed  framework  with  the  real-word  data  from  Facebook  page 
showed  that  users’  attributes  are  predictable.  This  work  paves  the  way  for  many 
exiting  work  including  measuring  and  mitigating  vulnerabilities  of  Q2  users. 

There  are  many  extensions  and  work  that  are  worth  further  explorations.  I  sum¬ 
marize  the  future  work  as  below 

5.1  User’s  Vulnerability  Across  Platforms 

Throughout  this  dissertation,  I  focused  exclusively  on  a  single  social  networking 
platform  for  managing  one’s  vulnerability.  Often  a  user  has  multiple  accounts  on 
different  social  media  sites,  and  is  unaware  of  how  much  information  about  her  can 
be  publicly  available  for  everybody.  Although  such  information  about  a  user  can  be 
challenging  to  garner  at  one  place,  it  aids  attribute  prediction  frameworks  and  there 
by  making  user’s  more  vulnerable. 
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As  a  preliminary  study,  I  designed  a  novel  web  based  tool  [34]  for  collecting  at¬ 
tribute  values  of  interest  associated  with  a  particular  social  media  user.  1  refer  to 
these  attributes  as  Provenance  Attributes  and  the  tool  as  Provenance  Data  Collector. 
Currently  this  tool  is  designed  to  assist  social  media  users  to  collect  provenance  data 
of  more  than  half  a  billion  Twitter  users  1 ,  and  more  than  a  billion  Facebook  users  2. 
Provenance  data  collector  3  is  an  online  data  collection  tool  focusing  on  efficiently 
retrieving  useful  attribute  values  of  a  given  Twitter  or  Facebook  user.  This  tool 
features  an  intuitive  user  interface  and  is  designed  to  enable  fast  retrieval  of  a  maxi¬ 
mum  number  of  desired  provenance  attributes.  If  some  desired  provenance  attributes 
are  uncertain,  the  tool  provides  best  possible  URL  (Uniform  Resource  Locator)  to 
help  users  further  their  findings.  In  addition  to  provenance  attributes,  the  tool  also 
presents  other  attribute  values  and  related  images  during  the  search,  and  measures 
to  evaluate  efficiency  of  the  system.  Figure  5.1  shows  an  overview  of  the  tool  for  col¬ 
lecting  provenance  attribute  values.  The  user  vulnerability  research  can  be  expanded 
by  examining  the  impact  of  such  tools  on  a  user’s  vulnerability.  1  think  protecting 
against  vulnerabilities  arising  from  multiple  platforms  is  a  big  challenge. 

5.2  Exploring  Inactive  Users  with  Private  Settings 

Next  challenge  for  vulnerability  research  is  to  identify  the  vulnerability  of  inactive 
users  with  private  settings.  There  are  two  way  to  explore  this  problem.  The  first 
way  is  to  exploit  the  unique  features  in  the  usenames  alone  to  identify  a  vulnerability, 
whereas  as  other  way  is  to  obtain  more  information  about  inactive  users  across  social 
media  sites.  Recent  research  has  shown  that  often  users  are  more  active  on  one  social 

1http ://en. Wikipedia . org/ wiki/Twitter 

2http  ://en.  Wikipedia .  org/ wiki/Facebook 

3The  provenance  data  collector  tool  is  located  at  http :  //blogtrackers  .  f ulton .  asu .  edu/Prov_ 
Attr,  and  demonstration  video  can  be  found  at  http://www.screencast.eom/t/XujEYbBXBKBd 
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Provenance  Data  Collector 


Figure  5.1:  Web  Interface  of  the  Provenance  Data  Collector  Tool  Showing  Attribute 
Values  of  President  Barack  Obama  (@barackobama) 


networking  site  than  another  [67].  Also,  usernames  alone  have  been  used  to  connect 
unique  users  across  social  media  sites  [81]. 

5.3  Identifying  Passive  Attackers 

Using  the  proposed  vulnerability  work,  one  can  identify  friends  whose  actions  can 
potentially  compromise  a  user  privacy.  Let  us  assume  that  a  user  is  doing  the  most  by 
effectively  employing  the  profile  settings  to  keep  her  personal  data  private.  However, 
there  still  exist  passive  attackers  who  can  breach  a  user’s  privacy.  Passive  privacy 
attackers  are  those  who  just  want  to  harvest  real  friends  to  increase  their  credibility 
and  used  them  later  to  target  one’s  privacy.  There  are  three  possible  cases  using  which 
a  passive  attacker  can  compromise  a  user’s  privacy:  (1)  a  passive  attacker  becomes  a 
friend  with  a  user;  (2)  a  passive  attacker  becomes  a  friend  with  a  user’s  friend;  and 
(3)  a  passive  attacker  is  outside  of  a  user’s  2-hop  network. 
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In  the  first  case,  a  user  is  completely  compromised  as  she  trusts  a  passive  attacker 
by  becoming  a  friend.  This  case  further  strengthen  an  experimental  finding  that  less 
vulnerable  users  can  become  more  vulnerable  if  they  are  careless  when  making  new 
friends.  The  proposed  vulnerability  approach  can  only  be  able  to  identify  such  at¬ 
tackers  if  they  involved  in  exposing  a  user’s  private  information  to  others.  Hoiwever, 
proactive  approaches  need  to  be  investigated  and  designed  to  mitigate  the  vulnerabil¬ 
ity  arising  from  a  passive  attacker  if  he  is  assumed  to  be  an  on-line  but  not  real-world 
friend  of  a  user.  Currently  in  such  situation,  a  user  can  form  a  social  circles  of  friends 
based  on  the  ties  in  the  on-line  and  real  world.  A  user  can  then  selectively  share  sen¬ 
sitive  information  among  its  social  circles.  Another  approach  of  dealing  with  the  first 
case  is  that  behavioral  patterns  of  friends  can  be  investigated  to  automatically  detect 
the  passive  privacy  attackers.  However,  this  approach  will  be  ineffective  if  a  passive 
attacker  exhibit  the  behavioral  patterns  similar  to  those  friends  who  are  minimally 
exposing  others. 

The  second  case  often  arises  when  a  user’s  friend  is  less  careful  in  establishing 
a  new  friend  connection.  The  proposed  approach  is  more  likely  to  identify  such 
vulnerable  friend  as  she  is  less  careful  about  her  own  as  well  as  friends  privacy.  A 
user  can  also  notify  his  vulnerable  friends  about  potential  risks  and  request  them  to 
take  necessary  actions.  In  this  case  the  risk  of  accessing  a  user’s  private  information 
is  less  in  comparison  with  the  first  case.  From  the  privacy  perspective,  the  third  case 
is  the  best  possible  scenario  in  comparison  with  the  other  two.  Using  the  proposed 
approach  user  can  periodically  measure  vulnerability.  If  a  user  observes  a  rise  in 
vulnerability,  she  can  notify  his  friends  about  recent  actions  which  causes  the  increase 
in  her  vulnerability.  This  approach  can  help  a  user  to  keep  passive  attacker  from 
inhitrating  2-hop  network. 

Social  networking  sites  like  Faeebook  and  GooglePlus  can  also  help  in  detecting 
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such  passive  privacy  attackers  by  analyzing  their  browsing  (clicking)  behavior  [77]. 
Assumption  here  is  that  passive  attackers  are  goal  oriented  and  involved  in  a  distinc¬ 
tive  browsing  patterns  than  normal  users. 

5.4  Extending  User  Vulnerability  to  Cope  with  Identity  Theft 

The  proposed  definition  of  a  user  vulnerability  is  based  on  the  visibility  and  ex¬ 
posure  of  a  users  profile  through  not  only  attributes  settings  but  also  his  friends.  We 
measure  a  user  vulnerability  using  three  factors:  (1)  users  privacy  settings  that  can 
reveal  personal  information;  (2)  a  users  action  on  a  social  networking  site  that  can 
expose  their  friends  personal  information;  and  (3)  friends  action  on  a  social  network¬ 
ing  site  that  can  reveal  users  personal  information.  In  other  words,  the  proposed 
vulnerability  measure  provides  how  much  risk  to  privacy  a  user  is  due  to  her  and 
friends  actions. 

Besides  the  actions  of  a  user  and  friends,  a  user  is  also  vulnerable  to  other  types 
of  privacy  attacks  such  as  identity  theft  or  cloning.  If  a  cloned  user  is  involved  in 
exposing  a  normal  user’s  profile,  then  the  proposed  vulnerability  measure  can  be 
able  to  direct  to  the  origin  of  such  attack.  Otherwise,  the  proposed  solution  can  not 
be  able  to  identify  a  cloned  user.  However,  a  social  networking  site  like  Facebook 
can  play  a  significant  role  in  detecting  a  cloned  user,  as  they  are  expecting  to  have 
different  browsing  (click)  activities  [77]  in  comparison  with  the  normal  users.  Hence, 
further  investigation  needed  to  extend  existing  user  vulnerability  approaches  to  handle 
identity  thefts  or  cloning  attacks. 

5.5  Measuring  and  Reducing  Vulnerability  of  Active  Users  with  Private  Settings 

In  Chapter  3,  I  successfully  demonstrate  a  method  to  predict  personal  attributes 
of  active  users  with  private  settings.  This  paves  the  way  to  the  next  important 
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task  of  measuring  and  reducing  vulnerabilities  of  such  users.  One  way  of  reducing 
such  vulnerabilities  is  by  providing  more  control  to  users’  so  that  they  can  control 
their  profile  settings.  However  such  profile  settings  requires  social  networking  sites 
to  change  their  existing  architecture  and  also  limits  users’  social  behavior.  The  other 
way  of  reducing  vulnerabilities  is  to  diversify  user’s  activities  so  that  the  framework 
proposed  in  chapter  3  can  not  extract  patterns  and  failed  to  predict  the  personal 
attributes. 
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